Labs

Lab 1: Basics

Data Types

To begin any statistical analysis, the first step is to download a dataset onto your computer. Datasets come in various formats, each identified by a specific file extension. Common examples include CSV files (.csv), Text files (.txt), Excel spreadsheets (.xlsx), and Stata data files (.dta). Knowing the format of your dataset is crucial, as it determines how you will import, manipulate, and analyze the data in statistical software.

Now, skim through the following Wikipedia pages on the four data formats: (1) text file: https://en.wikipedia.org/wiki/Text_file; (2) CSV file: https://en.wikipedia.org/wiki/Comma-separated_values; (3) Excel spreadsheets: https://en.wikipedia.org/wiki/Microsoft_Excel; (4) Stata data file: https://en.wikipedia.org/wiki/Stata.

For Mac users, to know what the data type is, you can check this YouTube video: https://www.youtube.com/watch?v=wyDFG176ofs. For Windows users, follow these steps to identify the file type: (1) left-click on the file; (2) right-click on the file to open a menu; (3) left-click on “Properties.”; (4) in the window that appears, look for the section labeled “Type of file.”

In this course, we will primarily work with datasets in three formats: CSV files, Excel spreadsheets, and Stata data files.

In today’s precept, we will demonstrate how to import datasets in CSV format and Stata data files into RStudio. I have uploaded three datasets to Canvas: pol346-precept01-us-gini.csv (CSV format), pol346-precept01-sweden.csv (CSV format) and pol346-precept01-un-population.dta (Stata data file). You now need to download all datasets to your computer. Furthermore, you also need to download the R Markdown file for this precept, POL346_Precept1_YF.Rmd into your computer.

Create and Organize Your Folder

Now, create a folder on your computer for the precept sessions. On my computer, I created a folder named precept to store all of the precept materials. For guidance on creating folders, check out these two YouTube videos: (1) for Mac users: https://www.youtube.com/watch?v=fIPtINwnpo4; (2) for Windows users: https://www.youtube.com/watch?v=Amd6V-ERLO8.

Now, create a subfolder named precept01 within the precept folder. This precept01 folder will be used to store the R Markdown file and datasets for the first precept session.

Moving Files for Precept 01 to the precept01 Folder

After having the two folders ready, we need to move the three datasets and the R Markdown file for this precept to the relevant folders.

The three datasets and the R Markdown file should be in your Downloads folder on your computer. For issues with finding files on the Downloads section on your computer, check out these two YouTube videos: (1) for Mac users: https://www.youtube.com/watch?v=_tBx1TyU2UM; (2) for Windows users: https://www.youtube.com/watch?v=KTBxzrM1Z50.

If the three datasets and the R Markdown file are not in your Downloads folder, you need to locate them on your computer. For guidance on finding a specific file’s location, check out these two YouTube videos: (1) for Mac users: https://www.youtube.com/watch?v=PHuKkqymFE8; (2) for Windows users: https://www.youtube.com/watch?v=l5UnnETOyOc.

After downloading the three datasets and the R Markdown file and locating them on your computer, you need to move the three datasets and the R Markdown file to the precept01 folder. For guidance on moving files to a specific folder, check out these two YouTube videos: (1) for Mac users: https://www.youtube.com/watch?v=w1izlhicjvQ; (2) for Windows users: https://www.youtube.com/watch?v=Iz8kmeslVkk.

Working Directory

Let’s first try using the getwd() function in R. The output from this line of code is displayed below.

getwd()
## [1] "/Users/yf5068/Library/CloudStorage/Dropbox/*Princeton/2025_Summer/Book346/labs"

Essentially, it shows that the current working directory for RStudio is where the R Markdown file is located. The location is shown as: under /Users/yingjief, inside the Dropbox folder. You can see the full path in term of how I organize files. It is important that files are organized accordingly to read data and save output. The / symbol indicates the hierarchical relationship between folders.

You can also checkout the following YouTube videos on file path in your computer: (1) for Mac users: https://www.youtube.com/watch?v=3TAEC-1YUZw; (2) for Windows users: https://www.youtube.com/watch?v=BMT3JUWmqYY.

Reading a CSV Data into RStudio

In RStudio, we will now use the read.csv() function to load the US Gini coefficient data. To make the read.csv() function work, we need to provide the exact file path of the CSV file. Additionally, we will save the dataset into the RStudio environment under the name us_gini.

In our first try, we just give the file name to read.csv. We see an error message. Any guess why there is an error message? The reason is that RStudio does not know what type of file it is. It needs to know that it is a CSV file. So, we need to add .csv at the end of the file name.

Now, let us add .csv in the codes, to tell RStudio that pol346-precept01-us-gini is a CSV file. Now, it worked! Let us also use head() function to see the first few rows of the data.

us_gini <- read.csv("data/pol346-precept01-us-gini.csv")
head(us_gini)
##   X year  gini
## 1 1 1947 0.376
## 2 2 1948 0.371
## 3 3 1949 0.378
## 4 4 1950 0.379
## 5 5 1951 0.363
## 6 6 1952 0.368

Now, let us give read.csv() the file path and the data name, let us see what happens. This returns the same results as before. In some sense, the file path is redundant because the pol346-precept01-us-gini.csv is in the current working directory.

However, when files and datasets are located at different places or you have different devices, you can save the path where you save your datasets. Below, I present two ways you save the path or you specify a different path.

Note that we need to add an additional / after precept-01 in the file path to tell computer that it needs to find pol346-precept01-us-gini.csv within the precept-01 folder.

#save path from getwd()
my_path <- getwd() #we can save the path here.
my_path
## [1] "/Users/yf5068/Library/CloudStorage/Dropbox/*Princeton/2025_Summer/Book346/labs"
us_gini_2 <- read.csv(paste0(my_path,"/data/pol346-precept01-us-gini.csv"))
head(us_gini_2)
##   X year  gini
## 1 1 1947 0.376
## 2 2 1948 0.371
## 3 3 1949 0.378
## 4 4 1950 0.379
## 5 5 1951 0.363
## 6 6 1952 0.368

For now, let us move the the CSV file, pol346-precept01-sweden.csv, to the precept folder. Then, let us read this CSV file.

First try of using read.csv() function via just giving the file name. R says: “Error in file(file,”rt”) : cannot open the connection”. This is because the file path is not correct. RStudio still thinks it is under precept-01 folder. However, the file pol346-precept01-sweden.csv is not in precept-01 folder anymore. Make sure you always specify the correct file path.

sweden <- read.csv("data/pol346-precept01-sweden.csv")

To fix the problem, we can manually give the file path and data name to the read.csv() function.

sweden <- read.csv(paste0(my_path,"/data/pol346-precept01-sweden.csv"))

Another way to fix the problem is to change the working directory to the precept folder using setwd() function. Then, we can proceed by giving read.csv() the file name, pol346-precept01-sweden.csv. We will also see that we cannot read pol346-precept01-us-gini.csv because it is not under the precept folder.

sweden2 <- read.csv("data/pol346-precept01-sweden.csv")
head(sweden2)
##   country    period   age  births deaths   py.men py.women       l_x
## 1     SWE 1950-1955   0-4   0.000 13.765 1490.037 1410.502 100000.00
## 2     SWE 1950-1955   5-9   0.000  1.302 1542.698 1470.816  97575.29
## 3     SWE 1950-1955 10-14   0.000  1.216 1266.321 1217.133  97308.11
## 4     SWE 1950-1955 15-19  40.823  1.581 1078.133 1049.193  97093.74
## 5     SWE 1950-1955 20-24 141.137  2.264 1119.421 1105.129  96733.76
## 6     SWE 1950-1955 25-29 160.882  2.885 1305.003 1284.552  96223.56
us_gini_3 <- read.csv("data/pol346-precept01-us-gini.csv")

With a new chunk, we see that the working directory is still in precept-01 folder.

getwd()
## [1] "/Users/yf5068/Library/CloudStorage/Dropbox/*Princeton/2025_Summer/Book346/labs"

Reading a Stata Data File into RStudio

We will now read the Stata data file pol346-precept01-un-population.dta into RStudio. To do this, we use the read_dta() function, which is part of the user-contributed haven package. First, if you haven’t already, you need to install the haven package using install.packages(). Then, load the package (i.e., tell RStudio that we will be using this user-contributed package in our current work) by using the library() function.

# I add # to the code below to comment out the code so that RStudio will not execute this line of code if you haven't dolowaded the package, you should execute this code
# install.packages("haven")
library(haven)

Now, let us use read_dta() function to read the Stata data file into RStudio.

un_pop <- read_dta("data/pol346-precept01-un-population.dta")
head(un_pop)
## # A tibble: 6 × 2
##    year world_pop
##   <dbl>     <dbl>
## 1  1950     2526.
## 2  1960     3026.
## 3  1970     3691.
## 4  1980     4449.
## 5  1990     5321.
## 6  2000     6128.

Finally, let’s clean up the memory with the codes below. After executing the code, all objects stored in the RStudio environment will be removed.

#rm(list = ls())

For me, I will also move pol346-precept01-sweden.csv back to the precept-01 folder, as it makes sense to keep all files used in the first precept session organized in this folder.

Now, click the Knit button, and select Knit to Word. You can then convert the saved Word file to a PDF file.

For the second part, we will cover logical relationships in R and how to subset data. Understanding logical relationships is a logical precursor for learning how to subset data.

Logical Operators in R

There are only two logical values, TRUE and FALSE. We will use logicals to assess whether a condition is satisfied (TRUE) or not (FALSE).

# check class of logicals
class(TRUE)
## [1] "logical"
class(FALSE)
## [1] "logical"
# check numerical values of logicals
as.integer(TRUE)
## [1] 1
as.integer(FALSE)
## [1] 0

There are three logical operators in R: (1) the AND operator “&”, which takes two logical values and returns TRUE if both values are TRUE, (2) the OR operator “|”, which takes two logical values and returns TRUE if at least one of the values is TRUE, and (3) the NOT operator “!”, which negates the logical value it is applied to.

In a loose sense, the AND operator is similar to intersection in set theory, the OR operator is similar to union, and the NOT operator is similar to complement.

# illustrate the AND operator
FALSE & FALSE
## [1] FALSE
FALSE & TRUE
## [1] FALSE
TRUE & FALSE
## [1] FALSE
TRUE & TRUE
## [1] TRUE
# illustrate the OR operator
FALSE | FALSE
## [1] FALSE
FALSE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
TRUE | TRUE
## [1] TRUE
# illustrate the NOT operator
!FALSE
## [1] TRUE
!TRUE
## [1] FALSE

Now, let us use the logical operators to evaluate some relationships between numbers.

# assign 6 to an object x
x <- 6

# assign 8 to an object y
y <- 8

# is x >= 2?
x >= 2
## [1] TRUE
# is x <= 1
x <= 1
## [1] FALSE
# is x > 0 and y < 10?
(x > 0) & (y < 10)
## [1] TRUE
# is x < 0 and y < 10?
(x < 0) & (y < 10)
## [1] FALSE
# is x > 0 or y < 10?
(x > 0) | (y < 10)
## [1] TRUE
# is x < 0 or y < 10?
(x < 0) | (y < 10)
## [1] TRUE
# is x < 0 or y < -8?
(x < 0) | (y < -8)
## [1] FALSE

In R, we can check whether an object is equal to or not equal to a specific value using comparison operators. The “==” operator checks for equality, while the “!=” operator checks for inequality.

# check whether x is equal to 6
x == 6
## [1] TRUE
# check whether x is not equal to 6
x != 6
## [1] FALSE
# check whether x is equal to 8
x == 8
## [1] FALSE
# check whether x is not equal to 8
x != 8
## [1] TRUE

Use subset() Function to Subset Datasets

First, let’s load the dataset for this precept into RStudio and save it to a data frame. This should be familiar since we practiced it extensively in last week’s precept and problem set.

# load readxl package and read data
require(readxl)
## Loading required package: readxl
office_hour <- read_excel("data/pol345-precept02-oh.xlsx", sheet = "office-hour")

# view data
knitr::kable(head(office_hour, 20))
Name Role Email OH OH Day Location PU
Yousuf Abdelfatah Preceptor Tuesday 2:00 PM to 3:00 PM Tuesday Fisher B11 Yes
Huseyin Emre Ceyhun Preceptor Wednesday 4:00 PM - 5:00 PM Wednesday Fisher B11 Yes
Tolgahan Dilgin Head Preceptor Monday 11:00 AM to 12:00PM Monday Fisher 314 No
Isaiah Johnson Preceptor Thursday 9:00 AM to 10:00 AM Thursday Fisher B11 Yes
Shourya Sen Preceptor Thursday 4:00 PM to 5:00 PM Thursday Fisher B11 Yes
Zeyang Yu Instructor Wednesday 08:50 AM to 09:50 AM Wednesday Fisher 306 No
# since the data only contains 6 observations, let's use head() to see the data
head(office_hour)
## # A tibble: 6 × 7
##   Name                Role           Email         OH    `OH Day` Location PU   
##   <chr>               <chr>          <chr>         <chr> <chr>    <chr>    <chr>
## 1 Yousuf Abdelfatah   Preceptor      yaabdelfatah… Tues… Tuesday  Fisher … Yes  
## 2 Huseyin Emre Ceyhun Preceptor      heceyhun@pri… Wedn… Wednesd… Fisher … Yes  
## 3 Tolgahan Dilgin     Head Preceptor dilgin@princ… Mond… Monday   Fisher … No   
## 4 Isaiah Johnson      Preceptor      ij1791@princ… Thur… Thursday Fisher … Yes  
## 5 Shourya Sen         Preceptor      shouryas@pri… Thur… Thursday Fisher … Yes  
## 6 Zeyang Yu           Instructor     arthurzeyang… Wedn… Wednesd… Fisher … No

To review, the top row of the dataset contains the following variable names: (1) Name (names of the instruction team members), (2) Role (their roles), (3) Email (their email addresses), (4) OH (office hours), (5) OH Day (the day of the week for office hours), (6) Location (office hour locations), and (7) PU (indicating if the person is a Princeton PhD student). Furthermore, the first column displays the Name variable, the second column shows the Role variable, and so on.

Now, let’s create a new dataset that only contains the preceptors using the subset() function. This dataset is be a subset of the full dataset. We can achieve this either by excluding the observation for the instructor or by keeping only the preceptors and head preceptors. Remember to check the documentation for the subset() function here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/subset.

# method 1: subset data with the condition Role not being the instructor
offce_hour_preceptor1 <- subset(office_hour, Role != "Instructor")
head(offce_hour_preceptor1)
## # A tibble: 5 × 7
##   Name                Role           Email         OH    `OH Day` Location PU   
##   <chr>               <chr>          <chr>         <chr> <chr>    <chr>    <chr>
## 1 Yousuf Abdelfatah   Preceptor      yaabdelfatah… Tues… Tuesday  Fisher … Yes  
## 2 Huseyin Emre Ceyhun Preceptor      heceyhun@pri… Wedn… Wednesd… Fisher … Yes  
## 3 Tolgahan Dilgin     Head Preceptor dilgin@princ… Mond… Monday   Fisher … No   
## 4 Isaiah Johnson      Preceptor      ij1791@princ… Thur… Thursday Fisher … Yes  
## 5 Shourya Sen         Preceptor      shouryas@pri… Thur… Thursday Fisher … Yes
# method 2: subset data with the condition Role being either preceptor or head preceptor
# grepl(): https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep
offce_hour_preceptor2 <- subset(office_hour, 
                                grepl("Preceptor", Role))
head(offce_hour_preceptor2)
## # A tibble: 5 × 7
##   Name                Role           Email         OH    `OH Day` Location PU   
##   <chr>               <chr>          <chr>         <chr> <chr>    <chr>    <chr>
## 1 Yousuf Abdelfatah   Preceptor      yaabdelfatah… Tues… Tuesday  Fisher … Yes  
## 2 Huseyin Emre Ceyhun Preceptor      heceyhun@pri… Wedn… Wednesd… Fisher … Yes  
## 3 Tolgahan Dilgin     Head Preceptor dilgin@princ… Mond… Monday   Fisher … No   
## 4 Isaiah Johnson      Preceptor      ij1791@princ… Thur… Thursday Fisher … Yes  
## 5 Shourya Sen         Preceptor      shouryas@pri… Thur… Thursday Fisher … Yes

Now, let’s create a new dataset that contains only the graduate preceptors using the subset() function with the logical AND operator. This can be done in one of three ways: (1) by excluding both the instructor and head preceptor, (2) by keeping only preceptors who are also Princeton graduate students, or (3) by keeping only observations where the Role is ‘preceptor.’ We will use the first two methods to demonstrate the use of the logical AND operator in R.

# method 1: subset data with excluding both instructor and head preceptor
offce_hour_graduate1 <- subset(office_hour, 
                               Role != "Head Preceptor" & Role != "Instructor")
head(offce_hour_graduate1)
## # A tibble: 4 × 7
##   Name                Role      Email              OH    `OH Day` Location PU   
##   <chr>               <chr>     <chr>              <chr> <chr>    <chr>    <chr>
## 1 Yousuf Abdelfatah   Preceptor yaabdelfatah@prin… Tues… Tuesday  Fisher … Yes  
## 2 Huseyin Emre Ceyhun Preceptor heceyhun@princeto… Wedn… Wednesd… Fisher … Yes  
## 3 Isaiah Johnson      Preceptor ij1791@princeton.… Thur… Thursday Fisher … Yes  
## 4 Shourya Sen         Preceptor shouryas@princeto… Thur… Thursday Fisher … Yes
# method 2: subset data with keeping only preceptors who are also Princeton graduate students
offce_hour_graduate2 <- subset(office_hour, 
                               grepl("Preceptor", Role) & PU == "Yes")
head(offce_hour_graduate2)
## # A tibble: 4 × 7
##   Name                Role      Email              OH    `OH Day` Location PU   
##   <chr>               <chr>     <chr>              <chr> <chr>    <chr>    <chr>
## 1 Yousuf Abdelfatah   Preceptor yaabdelfatah@prin… Tues… Tuesday  Fisher … Yes  
## 2 Huseyin Emre Ceyhun Preceptor heceyhun@princeto… Wedn… Wednesd… Fisher … Yes  
## 3 Isaiah Johnson      Preceptor ij1791@princeton.… Thur… Thursday Fisher … Yes  
## 4 Shourya Sen         Preceptor shouryas@princeto… Thur… Thursday Fisher … Yes

Now, let’s create some new datasets using the subset() function with the logical OR operator. See the examples below.

# subset data: either not a PU PhD student or not holding office hour on Monday
office_hour_subset1 <- subset(office_hour,
                              PU == "No" | `OH Day` != "Wednesday")
head(office_hour_subset1)
## # A tibble: 5 × 7
##   Name              Role           Email           OH    `OH Day` Location PU   
##   <chr>             <chr>          <chr>           <chr> <chr>    <chr>    <chr>
## 1 Yousuf Abdelfatah Preceptor      yaabdelfatah@p… Tues… Tuesday  Fisher … Yes  
## 2 Tolgahan Dilgin   Head Preceptor dilgin@princet… Mond… Monday   Fisher … No   
## 3 Isaiah Johnson    Preceptor      ij1791@princet… Thur… Thursday Fisher … Yes  
## 4 Shourya Sen       Preceptor      shouryas@princ… Thur… Thursday Fisher … Yes  
## 5 Zeyang Yu         Instructor     arthurzeyangyu… Wedn… Wednesd… Fisher … No
# subset data: either a not PU PhD student or holding office hour on Wendesday
office_hour_subset2 <- subset(office_hour,
                              PU == "No" | `OH Day` == "Wednesday")
head(office_hour_subset2)
## # A tibble: 3 × 7
##   Name                Role           Email         OH    `OH Day` Location PU   
##   <chr>               <chr>          <chr>         <chr> <chr>    <chr>    <chr>
## 1 Huseyin Emre Ceyhun Preceptor      heceyhun@pri… Wedn… Wednesd… Fisher … Yes  
## 2 Tolgahan Dilgin     Head Preceptor dilgin@princ… Mond… Monday   Fisher … No   
## 3 Zeyang Yu           Instructor     arthurzeyang… Wedn… Wednesd… Fisher … No

Lab 2: Data Visualization

Learning Goals

Things we are going to cover in this lab:

  1. Summary statistics: mean, variance, median, conditional mean
  2. Visualizing summary statistics: box plot
  3. Visualizing probability distributions: histogram, CDF, density plot
# Quick reference sheet for R: https://www.datacamp.com/cheat-sheet/getting-started-r

First, we will talk about summary statistics in R. Let us first read the data, “afghan.csv”, into RStudio.

getwd()
## [1] "/Users/yf5068/Library/CloudStorage/Dropbox/*Princeton/2025_Summer/Book346/labs"
# read Afghan data and save it to an object, afghan
afghan <- read.csv("data/afghan.csv")

Mean, Variance, and Conditional Mean

Check Sample Size

Before starting our analysis, it is always advisable to check the sample size in the dataset. Let’s first check the number of observations in the “afghan” dataset. The result shows that there are 2754 observations in this dataset.

# check the number of rows in afghan data
nrow(afghan)
## [1] 2754

We can also (manually) examine the sample size for each variable to see how many observations are available. The results below show that 0 observation is missing for the age variable, and 154 observations are missing for the income variable.

# check the number of observations for age variable in afghan dataset
#is.na(afghan$age) # how to interpret the output?
sum(is.na(afghan$age))
## [1] 0
# check the number of observations for income variable in afghan dataset
sum(is.na(afghan$income))
## [1] 154

Expectation

For illustration purposes, let’s now focus on the “age” variable in the “afghan” dataset, which has no missing values.

First, we can compute the expected value (also known as the average) for this variable. We can use the mean() function for this task.

# use mean() function to calculate the average for age variable
mean(afghan$age)
## [1] 32.39143

Moreover, we can manually calculate the average: \[E(age_{i}) = \frac{\sum_{i = 1}^{n} age_{i}}{n}\]

# manually calculate the average for the age variable
sum(afghan$age)/sum(!is.na(afghan$age))#numeric variable + no missing value
## [1] 32.39143
# sum(afghan$income) # income is categorical variable; there is missing value.

Both calculations give us the same number, 32.39143, which is expected. We conclude that the average age of the observations in this dataset is 32.39143.

Conditional Expectation

Let’s consider another variable in the dataset, employed, which has no missing values (You now need to check it by yourself!). We want to compute the following conditional expectations: (1) the average age of respondents given that they are not employed (that is, \(E[age_{i} \mid employed_{i} = 0]\)); and (2) the average age of respondents given that they are employed (that is, \(E[age_{i} \mid employed_{i} = 1]\)). The two conditional expectations are 30.1114 and 34.02368, respectively.

# calculate E(age | employed = 0)
mean(subset(afghan, employed == 0)$age, na.rm = TRUE)
## [1] 30.1114
# calculate E(age | employed = 1)
mean(subset(afghan, employed == 1)$age, na.rm = TRUE)
## [1] 34.02368

Variance and Standard Deviation

We can use the var() and sd() functions to calculate the variance and standard deviation for a variable. The results show that the variance and the standard deviation for age are 151.0316 and 12.28949, respectively.

# calculate variance for age
var(afghan$age)
## [1] 151.0316
sd(afghan$age)
## [1] 12.28949

Now, let us manually calculate the variance for a variable with the two equivalent formulas: \[Var(age_{i}) = \frac{1}{n} \sum_{i = 1}^{n} \left( age_{i} - \frac{\sum_{i = 1}^{n} age_{i}}{n} \right)^{2}\] \[Var(age_{i}) = \frac{1}{n} \sum_{i = 1}^{n} age_{i}^{2} - \left(\frac{\sum_{i = 1}^{n} age_{i}}{n} \right)^{2}\] Interestingly, the manual calculation gives a slightly smaller number, 150.9768.

# use the first formula to manually calculate the variance of age
mean((afghan$age - mean(afghan$age))^2)
## [1] 150.9768
# use the second formula to manually calculate the variance of age
mean(afghan$age^2) - mean(afghan$age)^2
## [1] 150.9768

The discrepancy arises from a different formula used in the var() function. Specifically, the formula for calculating variance in the var() function is as follows: \[Var(age_{i}) = \frac{1}{n - 1} \sum_{i = 1}^{n} \left( age_{i} - \frac{\sum_{i = 1}^{n} age_{i}}{n} \right)^{2}\] With this new formula, we can manually replicate the results obtained from the var() function. This exercise emphasizes the importance of understanding the underlying mechanisms of each function to make sense of the statistical results.

# use the formula above to calculate the variance of age
sum((afghan$age - mean(afghan$age))^2)/(nrow(afghan) - 1)
## [1] 151.0316

Median

We can use the median() function to calculate median for a variable. The results indicate that the median age in the dataset is 30 years.

# calculate median for age
median(afghan$age)
## [1] 30

Putting Things Together

We can also use the summary() function to generate summary statistics, including the minimum, maximum, various quartiles, mean, and variance.

summary(afghan$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   22.00   30.00   32.39   40.00   80.00

Visualize Variable

Visualize Summary Statistics

We can use box plot to visualize part of the information in summary statistics. The box contains 50% of the data ranging from the lower quartile (25th percentile) to the upper quartile (75th percentile) with the solid horizontal line indicating the median value (50th percentile). Then, dotted vertical lines, each of which has its end indicated by a short horizontal line called a “whisker,” extend below and above the box. These two dotted lines represent the data that are contained within 1.5 IQR (interquartile range) below the lower quartile and above the upper quartile, respectively. Furthermore, the observations that fall outside 1.5 IQR from the upper and lower quartiles are indicated by open circles. We now plot the box plot for the age variable in the “afghan” dataset.

The box plot was invented by John Tukey, a former professor at Princeton University.

# box plot for the age variable in afghan data
boxplot(afghan$age, # data: age variable in afghan
        main = "Box Plot for Age", # title of the plot
        ylab = "Age", # label for y axis
        ylim = c(10, 80)) # range for y axis

Visualize Probability Distribution

A histogram can plot the probability mass function for a discrete random variable. To illustrate this, we will use the age variable from the Afghan dataset. As shown, the age variable ranges from 15 to 80 and consists entirely of integer values. This figure shows that the majority of the observations are under 50 years old.

#install.packages("ggplot2") # install ggplot2 package if you have not done so
library(ggplot2)
#Quick Guide to ggplot2: 
#https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf
#https://r-statistics.co/ggplot2-cheatsheet.html
# histogram for the age variable in afghan data
hist(afghan$age, # data: age variable in afghan
     breaks = seq(from = 15, to = 80, by = 1), # each bin ranges 1 year
     freq = FALSE, # plot probability densities 
     ylim = c(0, 0.06), # specify the range for y axis
     xlab = "Age", # specify label for x axis
     main = "Histogram of Respondent’s Age") # specify the title

ggplot(afghan, aes(x = age)) +
  # histogram with 1-year bins, probability density on y-axis
  geom_histogram(
    breaks = seq(15, 80, by = 1),   # bin edges: 15, 16, ..., 80
    aes(y = ..density..),           # use density instead of counts (like freq = FALSE)
    color = "black",                # outline color for bars
    fill = "steelblue"              # fill color for bars
  ) +
  # set y-axis range
  scale_y_continuous(limits = c(0, 0.06)) +
  # axis labels and title
  labs(
    x = "Age", 
    y = "Density", 
    title = "Histogram of Respondent’s Age"
  ) 
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can also plot the CDF for the age variable with the ecdf() function.

# plot CDF for the age variable
plot(ecdf(afghan$age),
     xlab = "Age",
     ylab = "CDF",
     main = "CDF for Respondents' Age")

ggplot(afghan, aes(x = age)) +
  # empirical cumulative distribution function (ECDF)
  stat_ecdf(geom = "step", color = "steelblue", size = 1) +
  # axis labels and title
  labs(
    x = "Age",
    y = "CDF",
    title = "CDF for Respondents' Age"
  ) 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can also plot the probability density for a continuous random variable, which is essentially a smoothed version of a histogram. To illustrate this, we will use the age variable as a discrete example to demonstrate how to create a density plot.

# produce a kernel density plot for age
plot(density(afghan$age),
     xlab = "Age",
     ylab = "Density",
     main = "Density for Respondents' Age")

ggplot(afghan, aes(x = age)) +
  # kernel density curve
  geom_density(
    fill = "steelblue",   # fill color under the curve
    alpha = 0.4,          # transparency of the fill
    color = "black",      # outline color of the curve
    size = 1              # line thickness
  ) +
  # axis labels and title
  labs(
    x = "Age",
    y = "Density",
    title = "Density for Respondents' Age"
  )

Plot income by age for each income group.

ggplot(afghan, aes(x = age, y = income)) +
  # draw boxplots of income by age
  geom_boxplot(outlier.alpha = 0.4, fill = "steelblue") +
  # separate plots by province
  facet_wrap(~ province) +
  # axis labels and title
  labs(
    x = "Age",
    y = "Income",
    title = "Income by Age for Each Province"
  ) +
  theme_minimal(base_size = 14) +
  # rotate x labels if too many ages
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#What's wrong with the output?
#The variable income is not correctly ordered. 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# recode income into factor with custom order
afghan <- afghan %>%
  mutate(
    income = factor(
      income,   # or whatever your income column is named
      levels = c(
        "less than 2,000",
        "2,001-10,000",
        "10,001-20,000",
        "20,001-30,000",
        "over 30,000",
                NA # keep NA explicit if coded as factor level
      )
    )
  )

ggplot(afghan, aes(x = age, y = income)) +
  # draw boxplots of income by age
  geom_boxplot(outlier.alpha = 0.4, fill = "steelblue") +
  # separate plots by province
  facet_wrap(~ province) +
  # axis labels and title
  coord_flip() +  # flip coordinates to make age vertical
  labs(
    x = "Age",
    y = "Income",
    title = "Income by Age for Each Province"
  ) +
  theme_minimal(base_size = 14) +
  # rotate x labels if too many ages
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Remember the purpose of plot is to visualize data to convey the most amount of information in the simplest way.

Lab 3: Simulation and Sampling

Learning Goals

Things we are going to cover:

  1. Sampling with and without replacement
  2. Simulating random variables
  3. Visualizing distributions
rm(list = ls()) # clear the environment
library(ggplot2)
library(dplyr)

Sampling

set.seed() function ensures that the random numbers generated can be reproduced. If you run the same code with the same seed, you will get the same results every time.

Replacement

  • When sampling with replacement, each time you draw a sample, you put it back into the pool before drawing the next sample. This means that the same item can be selected multiple times.
  • Without replacement: When sampling without replacement, once an item is drawn, it is not put back into the pool. This means that each item can only be selected once.
sample(1:10, 5,replace=TRUE) # sampling with replacement
## [1]  9  3 10  6  1
sample(1:10, 5,replace=TRUE) # sampling with replacement
## [1] 2 1 6 5 5
sample(1:10, 5,replace=FALSE) # sampling without replacement
## [1]  1  8  6  7 10
sample(1:10, 5,replace=FALSE) # sampling without replacement
## [1] 6 3 1 5 9
#What do you notice here?
#If you want to get the same results every time you run the code, you can set the seed using set.seed() function.
set.seed(123)
sample(1:10, 5,replace=TRUE) # sampling with replacement
## [1]  3  3 10  2  6
sample(1:10, 5,replace=FALSE) # sampling without replacement
## [1] 5 4 6 8 1

Taking the example from the lecture: if the random draw for jury is truly random (26 black men out of 100 people). Let’s simulate the process and see how the results look like compared to the null hypothesis (in the 100 selected potential jury, there were only 8 black people).

set.seed(123) # for reproducibility

# parameters
pool_size  <- 100 # size of jury pool
p_true     <- 26/100 #the true probability of being black in the population
obs_black  <- 8 # the number of black men from the selected jury pool

simulate one draw of jury pool and compare to observed

# simulate one draw of jury pool
sim_counts <- rbinom(1, size = pool_size, prob = p_true) #rbinom() function generates random numbers following a binomial distribution
sim_counts
## [1] 23
# compare to observed
sim_counts <= obs_black
## [1] FALSE

simulate 10 draws of jury pool and visualize the distribution

# simulate 10 draws of jury pool
n_trials <- 10 # number of simulations
sim_counts <- rbinom(n_trials, size = pool_size, prob = p_true) #rbinom() function generates random numbers following a binomial distribution
sim_counts
##  [1] 29 25 31 33 19 26 31 26 25 34
# visualize the distribution
df <- data.frame(sim_counts = sim_counts)
ggplot(df,aes(x=sim_counts))+
  geom_histogram(binwidth=1, fill="lightblue", color="black")+
  geom_vline(xintercept = obs_black, color = "red", linetype = "dashed", size = 1) 

simulate 100 draws of jury pool and visualize the distribution

# simulate 10 draws of jury pool
n_trials <- 100 # number of simulations
sim_counts <- rbinom(n_trials, size = pool_size, prob = p_true) #rbinom() function generates random numbers following a binomial distribution
sim_counts
##   [1] 25 28 27 21 32 23 19 24 34 31 28 28 37 28 28 26 27 24 21 34 32 28 30 18 26
##  [26] 29 23 24 23 21 25 25 24 22 21 23 26 23 31 19 25 30 21 27 22 21 29 32 25 28
##  [51] 20 25 23 30 25 30 30 30 25 29 27 28 13 26 23 25 27 24 21 23 28 25 29 21 25
##  [76] 36 31 31 22 21 28 24 28 24 22 29 20 26 26 27 24 26 34 26 31 32 27 25 21 33
# visualize the distribution
df <- data.frame(sim_counts = sim_counts)
ggplot(df,aes(x=sim_counts))+
  geom_histogram(binwidth=1, fill="lightblue", color="black")+
  geom_vline(xintercept = obs_black, color = "red", linetype = "dashed", size = 1) 

simulate 1000 draws of jury pool and visualize the distribution

# simulate 10 draws of jury pool
n_trials <- 1000 # number of simulations
sim_counts <- rbinom(n_trials, size = pool_size, prob = p_true) #rbinom() function generates random numbers following a binomial distribution
#sim_counts 
# visualize the distribution
df <- data.frame(sim_counts = sim_counts)
ggplot(df,aes(x=sim_counts))+
  geom_histogram(binwidth=1, fill="lightblue", color="black")+
  geom_vline(xintercept = obs_black, color = "red", linetype = "dashed", size = 1) 

# different simulation sizes
n_trials_list <- c(10, 100, 1000, 10000) # number of simulations

# run simulations for each n_trials
sim_results <- lapply(n_trials_list, function(n_trials) { #lapply() function applies a function over a list or vector
  sim_counts <- rbinom(n_trials, size = pool_size, prob = p_true) #rbinom() function generates random numbers following a binomial distribution
  data.frame(
    sim_counts = sim_counts, #save the simulated counts
    n_trials   = n_trials  #save the number of simulations
  )
})

# combine into one data frame
df <- do.call(rbind, sim_results)

ggplot(df, aes(x = sim_counts)) +
  geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
    geom_density(adjust = 1, fill = "black", alpha = 0.4) +
  geom_vline(xintercept = obs_black, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = p_true * pool_size, color = "blue", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Black Men in Jury Pool (Simulated)",
    subtitle = "Red = observed (8), Blue = expected (26)",
    x = "Number of Black Men in Pool (out of 100)",
    y = "Frequency"
  ) +
  facet_wrap(~ n_trials, scales = "free_y")

#distribution plot
ggplot(df, aes(x = sim_counts)) +
  # density line from simulation
  geom_density(adjust = 1, fill = "lightblue", alpha = 0.4) +
  # observed and expected lines
  geom_vline(xintercept = obs_black, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = p_true * pool_size, color = "blue", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Black Men in Jury Pool (Simulated)",
    subtitle = "Red = observed (8), Blue = expected (26)",
    x = "Number of Black Men in Pool (out of 100)",
    y = "Density"
  ) +
  facet_wrap(~ n_trials, scales = "free_y")

Calculate the probability of actually having 8 black men in the selected jury pool under the null hypothesis (p = 0.26).

# parameters
n <- 100
p <- 0.26
k <- 8

# exact probability
prob_exact <- dbinom(k, size = n, prob = p)
prob_exact
## [1] 3.620955e-06
prob_less_equal <- pbinom(k, size = n, prob = p)
prob_less_equal
## [1] 4.734795e-06

Roll the Dice

The key takeaway from this simulation is a visual demonstration of the Central Limit Theorem (CLT):

  • With few dice (e.g., 2 or 8), the distribution of sums is jagged and skewed, reflecting the discrete nature of dice rolls.

  • As the number of dice increases, the distribution of the sum becomes smoother and more symmetric.

  • By the time you get to dozens of dice (e.g., 20, 50), the distribution of the sum looks very close to a bell-shaped normal distribution.

Intuition:

  • Each die is uniform(1–6).

  • The sum of independent uniform random variables tends toward normality as the number of variables increases.

  • This is exactly the Central Limit Theorem in action: the distribution of properly scaled sums of independent random variables converges to a normal distribution, regardless of the underlying distribution.

set.seed(123)

# parameters
n_sides  <- 6        # number of sides on each die
n_rolls  <- 10000    # number of rolls
dice_list <- seq(2, 50, by = 6)  # 2, 4, 6, ..., 20 dice

# simulate rolls for each number of dice
sim_results <- lapply(dice_list, function(n_dice) {
  sum_of_rolls <- replicate(n_rolls, sum(sample(1:n_sides, n_dice, replace = TRUE)))
  data.frame(
    sum_of_rolls = sum_of_rolls,
    n_dice = n_dice
  )
})

# combine into one data frame
df_rolls <- bind_rows(sim_results)

# histogram facetted by number of dice
ggplot(df_rolls, aes(x = sum_of_rolls)) +
  geom_histogram(binwidth = 1, fill = "lightgreen", color = "black") +
  labs(
    title = "Distribution of Sum of Dice Rolls",
    x = "Sum of Dice",
    y = "Frequency"
  ) +
  facet_wrap(~ n_dice, scales = "free")

# density facetted by number of dice
ggplot(df_rolls, aes(x = sum_of_rolls)) +
  geom_density(fill = "lightgreen", alpha = 0.4) +
  labs(
    title = "Density of Sum of Dice Rolls",
    x = "Sum of Dice",
    y = "Density"
  ) +
  facet_wrap(~ n_dice, scales = "free") 

Lab 4: Hypothesis Testing

Learning Goals

Things we are going to cover:

  1. Hypothesis Testing
  2. Permutation
  3. Bootstrapping
  4. Correlation
library(ggplot2)
library(dplyr)

Permutation Test

  • Permutation test is a non-parametric method for hypothesis testing.
  • It is used to determine whether two or more groups are significantly different from each other.
  • It is based on the idea of rearranging (shuffling) the data to create a null distribution of the test statistic.
# Load the chickwts dataset (built into R; contains chick weights by different feed types)
data(chickwts)

# Keep only the rows where feed is "casein" or "soybean"
chickwts  = chickwts %>% 
  filter(feed %in% c("casein", "soybean"))

# Extract the weights of chicks fed with casein
feed1 <- chickwts$weight[chickwts$feed == "casein"]
# Extract the weights of chicks fed with soybean
feed2 <- chickwts$weight[chickwts$feed == "soybean"]

# Create a violin/box/jitter plot of chick weights by feed type
ggplot(chickwts, aes(x = feed, y = weight, fill = feed)) +   # set feed as x-axis, weight as y-axis, and color fill by feed
  geom_violin(trim = FALSE, alpha = 0.3) +                   # violin plot shows distribution shape; trim=FALSE keeps tails
  geom_boxplot(width = 0.2, outlier.shape = NA, alpha = 0.5) + # boxplot shows median & quartiles, narrower width, hide outliers
  geom_jitter(width = 0.1, alpha = 0.6) +                    # add jittered raw points for individual weights
  labs(title = "Distribution of Chick Weights by Feed Type", # add title and axis labels
       x = "Feed Type", y = "Weight") +
  theme_minimal()                                            # use a clean minimal theme

# Calculate the observed difference in mean weights between casein and soybean feeds
observed_diff <- mean(feed1) - mean(feed2)

# Print the observed mean difference
observed_diff
## [1] 77.15476
set.seed(123)
#For sample a vector of length size with elements drawn from either x or from the integers 1:x.
x <- 1:12
# a random permutation
sample(x) # sample with the size of x (length(x)) from the vector x.
##  [1]  3 12 10  2  6 11  5  4  9  8  1  7
# bootstrap resampling -- only if length(x) > 1 !
sample(x, replace = TRUE)
##  [1] 11  5  3 11  9 12  9  9  3  8 10  7
set.seed(123)
# Combine and shuffle once
combined_data <- sample(c(feed1, feed2)) #Sampling from feed1 and feed2 with the size of length(feed1) + length(feed2)
combined_data #let's see what we get as a sampling result
##  [1] 248 193 230 379 222 250 283 404 171 158 199 271 216 332 316 243 327 359 390
## [20] 352 248 368 318 267 260 329
# Split into two permuted groups of the same sizes as original
permuted_feed1 <- combined_data[1:length(feed1)] #we assign the first chunk back to "feed1" (same size as original feed1)
permuted_feed2 <- combined_data[(length(feed1) + 1):length(combined_data)] #we assign the remaining chunk back to "feed2" (same size as original feed2)

#The idea is to randomly shuffle the combined data and then split it back into two groups of the same sizes as the original groups, which is to reassign labels for sampling output.

# Calculate the difference in means for this single shuffle
single_diff <- mean(permuted_feed1) - mean(permuted_feed2)

print(paste("One permutation difference:", single_diff))
## [1] "One permutation difference: -58.2619047619048"
# Set a random seed for reproducibility (so results are the same each run)
set.seed(123)

# Number of permutations to perform (larger = more precise p-value estimate)
n_permutations <- 10000  

# Create an empty numeric vector to store permuted differences in means
permutation_diffs <- numeric(n_permutations)

# Loop over the number of permutations
for (i in 1:n_permutations) {
  # Combine the two groups (casein and soybean) and randomly shuffle their order
  combined_data <- sample(c(feed1, feed2))
  
  # Assign the first chunk back to "feed1" (same size as original feed1)
  permuted_feed1 <- combined_data[1:length(feed1)]
  
  # Assign the remaining chunk back to "feed2" (same size as original feed2)
  permuted_feed2 <- combined_data[(length(feed1) + 1):length(combined_data)]
  
  # Compute the difference in means between the two permuted groups
  permutation_diffs[i] <- mean(permuted_feed1) - mean(permuted_feed2)
}

# Plot the null distribution of permuted mean differences
ggplot(data.frame(diff = permutation_diffs), aes(x = diff)) +
  geom_histogram(bins = 40, fill = "lightblue", color = "white") +  # histogram of permuted differences
  geom_vline(xintercept = observed_diff, color = "red", size = 1) + # vertical line at observed difference
  labs(title = "Permutation Test",
       x = "Difference in Means (Casein - Soybean)",
       y = "Count") +
  theme_minimal()                                                   # clean plot style

# Compute the permutation p-value:
# proportion of permuted differences at least as extreme as the observed difference
p_value <- sum(abs(permutation_diffs) >= abs(observed_diff)) / n_permutations

# Print the observed mean difference between casein and soybean
print(paste("Observed difference:", observed_diff))
## [1] "Observed difference: 77.1547619047619"
# Print the permutation test p-value
print(paste("P-value:", p_value))
## [1] "P-value: 0.0034"
# Interpretation of the test result based on 5% significance level
if (p_value < 0.05) {
  print("Reject the null hypothesis: There is a significant difference in mean weights between casein and soybean feeds.")
} else {
  print("Fail to reject the null hypothesis: No significant difference in mean weights between casein and soybean feeds.")
}
## [1] "Reject the null hypothesis: There is a significant difference in mean weights between casein and soybean feeds."
#The idea of permutation test is that if the null hypothesis is correct, then the labels (casein vs soybean) are interchangeable. So we shuffle the labels and see how often we get a difference in means as extreme as the observed one.

t.test(feed1, feed2) #let's see what we get from t.test
## 
##  Welch Two Sample t-test
## 
## data:  feed1 and feed2
## t = 3.2743, df = 21.635, p-value = 0.003521
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   28.23823 126.07129
## sample estimates:
## mean of x mean of y 
##  323.5833  246.4286
#The results from permutation test and t.test are quite similar in this case.

Bootstrap

Bootstrap is a resampling technique used to estimate statistics (like mean, variance, confidence intervals) from a sample by repeatedly resampling with replacement.

okcupid <- read.csv("data/okcupid_profiles.csv")
dim(okcupid) #let's see the dimension of the dataset 
## [1] 59946    31
# There are 31 columns/variables and 59946 rows in the dataset. This is a very rich dataset for you to explore.

Income Gender Gap: female and male income difference

#First, let's see how many observations we have for each income level to get a basic sense of the data in terms of range and distribution.
okcupid %>% # %>% is the pipe operator, which passes the left-hand side as the first argument to the function on the right-hand side.
  group_by(sex,income) %>% #this means I want to group the data by variable sex and income.
  summarise(count = n()) # I want to count the number of observations for each group.
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 26 × 3
## # Groups:   sex [2]
##    sex   income count
##    <chr>  <int> <int>
##  1 f         -1 21004
##  2 f      20000  1031
##  3 f      30000   320
##  4 f      40000   339
##  5 f      50000   311
##  6 f      60000   221
##  7 f      70000   166
##  8 f      80000   261
##  9 f     100000   256
## 10 f     150000    75
## # ℹ 16 more rows
# What do you notice about the output?
# What about the -1 value for income?

# Let's focus on the "age" variable for this example
ggplot(okcupid %>% 
         filter(income>0), 
       aes(x = income, group=factor(sex), color=factor(sex))) +
  geom_density() + # density plot to show income distribution
  labs(title = "Income Distribution by sex",
       x = "Income",
       y = "Density",
       color="Sex") 

ggplot(okcupid %>% 
         filter(income>0), 
       aes(x = income, group=factor(sex), color=factor(sex))) +
  geom_density() + # density plot to show income distribution
  xlim(0, 260000) + # limit x-axis to focus on typical income range
  labs(title = "Income Distribution by sex",
       x = "Income",
       y = "Density",
       color="Sex") 
## Warning: Removed 569 rows containing non-finite outside the scale range
## (`stat_density()`).

# Calculate the sample mean income by sex
okcupid %>% 
  filter(income>0) %>% 
  group_by(sex) %>% 
  summarise(mean_income = mean(income), 
            median_income = median(income),
            count = n())
## # A tibble: 2 × 4
##   sex   mean_income median_income count
##   <chr>       <dbl>         <int> <int>
## 1 f          86633.         40000  3113
## 2 m         110984.         60000  8391
income = okcupid %>% 
  filter(income>0) 

# Calculate the observed difference in mean income between
obs_diff_tab <- income %>% 
  group_by(sex) %>% 
  summarise(mu = mean(income)) %>% 
  arrange(sex) 
# Set seed so results are reproducible (randomization will give same results each run)
set.seed(123)

# Stratified bootstrap (resample within each sex; preserve group structure)

B <- 10000                       # Set number of bootstrap replications (10,000 samples)
boot_diffs <- numeric(B)         # Pre-allocate an empty numeric vector to store differences from each bootstrap sample

# Split the dataset into two groups based on sex (e.g., female vs. male)
split_by_sex <- split(income, income$sex)

# For each sex I want to sample 1000 individuals' incomes with replacement
#n <- 1000
# Note: n can be larger than the number of observations in each group,  
n <- 10000
# since bootstrap sampling is done WITH replacement. 
# This means some individuals may be selected multiple times, while others may not be selected at all.

# Start the bootstrap loop: repeat the resampling B times
for (b in seq_len(B)) {
  # Sample incomes for group 1 (e.g., female) with replacement, size = n
  samp_f <- sample(split_by_sex[[1]]$income, size = n, replace = TRUE)
  
  # Sample incomes for group 2 (e.g., male) with replacement, size = n
  samp_m <- sample(split_by_sex[[2]]$income, size = n, replace = TRUE)
  
  # Calculate the difference in mean income for this bootstrap replicate
  # (female mean - male mean; adjust order if you prefer male - female)
  boot_diffs[b] <- mean(samp_f) - mean(samp_m)
}

# Compute a 95% confidence interval from the bootstrap distribution
# Use the 2.5th and 97.5th percentiles of the bootstrap differences
ci <- quantile(boot_diffs, c(0.025, 0.975), na.rm = TRUE)

# Compute an approximate two-sided bootstrap p-value
# This is the proportion of bootstrap samples that produce a difference 
# as extreme or more extreme than 0 (the null hypothesis value),
# multiplied by 2 to make it two-sided.
p_boot <- 2 * min(mean(boot_diffs >= 0, na.rm = TRUE),
                  mean(boot_diffs <= 0, na.rm = TRUE))

# Results: display observed difference, confidence interval, and bootstrap p-value
# Observed difference in sample means (from the original dataset)
obs_diff = obs_diff_tab$mu[1] - obs_diff_tab$mu[2]

ci        # prints the 95% bootstrap confidence interval
##      2.5%     97.5% 
## -29818.08 -18925.88
p_boot    # prints the bootstrap p-value
## [1] 0
# Plot the bootstrap distribution with multiple reference lines
ggplot(data.frame(diff = boot_diffs), aes(x = diff)) +
  geom_histogram(fill = "lightblue") +                                   # histogram of bootstrap differences
  geom_vline(xintercept = mean(boot_diffs),                              # green line at mean of bootstrap distribution
             linetype = "solid", color = "green") +
  geom_vline(xintercept = ci,                                           # blue dashed lines for 95% CI bounds
             linetype = "dashed", color = "blue") +
  geom_vline(xintercept = obs_diff, color = "red", size = 1) +          # red line for observed difference
  geom_vline(xintercept = 0, linetype = "dotted", color = "black") +    # black dotted line at 0 (null hypothesis)
  labs(title = "Bootstrap Distribution of Mean Income Difference",
       x = "Difference in Mean Income")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# Compare results with a classical two-sample t-test
# This tests the null hypothesis that mean incomes are equal between sexes
t.test(income$income ~ income$sex)
## 
##  Welch Two Sample t-test
## 
## data:  income$income by income$sex
## t = -5.9762, df = 5974.6, p-value = 2.416e-09
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -32338.69 -16363.14
## sample estimates:
## mean in group f mean in group m 
##        86633.47       110984.39

Variance, Covariance, and Correlation

The variance of a random variable \(x\) is defined as \[ \mathrm{Var}(x) = \mathbb{E}\big[(x - \mu_x)^2\big], \] where \(\mu_x = \mathbb{E}[x]\) is the mean of \(x\).

The covariance between two random variables \(x\) and \(y\) is \[ \mathrm{Cov}(x, y) = \mathbb{E}\big[(x - \mu_x)(y - \mu_y)\big], \] where \(\mu_x = \mathbb{E}[x]\) and \(\mu_y = \mathbb{E}[y]\).

The correlation coefficient between \(x\) and \(y\) is \[ \mathrm{Cor}(x, y) = \frac{\mathrm{Cov}(x, y)}{\sqrt{\mathrm{Var}(x)\,\mathrm{Var}(y)}}. \]


Computation in R

  • If x is a vector, then:
    • var(x) computes the variance of x.
    • cov(x, y) computes the covariance between x and y.
    • cor(x, y) computes the correlation between x and y.
  • If x and y are matrices, then:
    • cov(x, y) returns the covariance matrix between the columns of x and the columns of y.
    • cor(x, y) returns the correlation matrix between the columns of x and the columns of y.
# Example
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Variance of x
var(x)
## [1] 2.5
# Covariance and Correlation between x and y
cov(x, y)
## [1] 5
cor(x, y)
## [1] 1
# Matrix case
X <- matrix(1:9, ncol = 3)
Y <- matrix(2:10, ncol = 3)

cov(X, Y)
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
cor(X, Y)
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
cor(okcupid$age, okcupid$income)
## [1] -0.00100384

Lab 5: Linear Regression

library(ggplot2)
library(dplyr)
library(readxl)

world <- read.csv("data/world_data.csv")

dplyr Basics

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate(): Adds new variables that are functions of existing variables.
  • select(): Picks variables based on their names.
  • filter(): Picks cases (rows) based on their values.
  • summarise(): Reduces multiple values down to a single summary statistic.
  • arrange(): Changes the ordering of the rows.
starwars %>% 
  filter(species == "Droid") # filter rows when species is Droid
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
starwars %>% 
  select(name, ends_with("color")) ##Other similar filtering operations: starts_whith(), contains(), matches() 
## # A tibble: 87 × 4
##    name               hair_color    skin_color  eye_color
##    <chr>              <chr>         <chr>       <chr>    
##  1 Luke Skywalker     blond         fair        blue     
##  2 C-3PO              <NA>          gold        yellow   
##  3 R2-D2              <NA>          white, blue red      
##  4 Darth Vader        none          white       yellow   
##  5 Leia Organa        brown         light       brown    
##  6 Owen Lars          brown, grey   light       blue     
##  7 Beru Whitesun Lars brown         light       blue     
##  8 R5-D4              <NA>          white, red  red      
##  9 Biggs Darklighter  black         light       brown    
## 10 Obi-Wan Kenobi     auburn, white fair        blue-gray
## # ℹ 77 more rows
## select is to select the columns where the variable name ends with "color"

starwars %>% 
   mutate(bmi = mass / ((height / 100) ^ 2)) %>%
    # create (or modify) variables in the dataset
    # here we keep the variable "name" as is,
    # and create a new variable "bmi"
    # bmi = weight (kg) / (height (m))^2
  select(name:mass, bmi)
## # A tibble: 87 × 4
##    name               height  mass   bmi
##    <chr>               <int> <dbl> <dbl>
##  1 Luke Skywalker        172    77  26.0
##  2 C-3PO                 167    75  26.9
##  3 R2-D2                  96    32  34.7
##  4 Darth Vader           202   136  33.3
##  5 Leia Organa           150    49  21.8
##  6 Owen Lars             178   120  37.9
##  7 Beru Whitesun Lars    165    75  27.5
##  8 R5-D4                  97    32  34.0
##  9 Biggs Darklighter     183    84  25.1
## 10 Obi-Wan Kenobi        182    77  23.2
## # ℹ 77 more rows
    # choose only certain columns to keep
    # here: from "name" through "mass" (inclusive), plus the new "bmi"
starwars %>% 
  arrange(desc(mass))
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Jabba D…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
##  2 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
##  3 IG-88       200   140 none       metal      red             15   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Bossk       190   113 none       green      red             53   male  mascu…
##  8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
##  9 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
## 10 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
# reorder the rows of the dataset
    # by default, arrange() sorts in ascending order
    # here we wrap mass with desc() to sort in descending order

starwars %>%
group_by(species) %>%
  # group the data by species so calculations are done per species
  summarise(
    # create summary statistics for each species group
    n = n(),                        # number of characters in that species
    mass = mean(mass, na.rm = TRUE), # average mass (ignoring missing values)
    median_height = median(height, na.rm = TRUE), # median height (ignoring missing values)
    max_height = max(height, na.rm = TRUE), # max height (ignoring missing values)
    min_height = min(height, na.rm = TRUE) # min height (ignoring missing
  ) %>%
  filter(
    # keep only the species that meet BOTH conditions:
    n > 1,      # at least 2 characters in the dataset
    mass > 50   # average mass greater than 50
  )
## # A tibble: 9 × 6
##   species      n  mass median_height max_height min_height
##   <chr>    <int> <dbl>         <dbl>      <int>      <int>
## 1 Droid        6  69.8            97        200         96
## 2 Gungan       3  74             206        224        196
## 3 Human       35  81.3           181        202        150
## 4 Kaminoan     2  88             221        229        213
## 5 Mirialan     2  53.1           168        170        166
## 6 Twi'lek      2  55             179        180        178
## 7 Wookiee      2 124             231        234        228
## 8 Zabrak       2  80             173        175        171
## 9 <NA>         4  81             179        185        157
#cheatsheet for dplyr: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

T-test in R

The t.test() function in R is used to perform a t-test, which is a statistical test used to determine if there is a significant difference between the means of two groups.
The t-test can be used for:

  • Independent samples: two separate groups.
  • Paired samples: the same group measured at two different times.

Arguments

  • x: A numeric vector containing your data values for a one-sample t-test, or the first sample for a two-sample t-test.
  • y: (Optional) A numeric vector containing the second sample for a two-sample t-test. If y is not provided, a one-sample t-test is performed.
  • alternative: Specifies the alternative hypothesis. Options are "two.sided" (default), "less", or "greater".
  • mu: The hypothesized mean (for a one-sample t-test) or the hypothesized difference in means (for a two-sample t-test). Default is 0.
  • paired: Logical. TRUE for a paired t-test, FALSE (default) for an independent two-sample t-test.
  • var.equal: Logical. TRUE assumes equal variances for the two samples (Student’s t-test). FALSE (default) uses Welch’s t-test.
  • conf.level: Confidence level for the confidence interval of the mean difference. Default is 0.95.
  • formula: An alternative way to specify the test using a formula, e.g., value ~ group, where value is numeric and group is a factor with two levels. Often used with the data argument.
  • data: The data frame containing the variables specified in the formula.

#one-sample
data_sample <- c(10, 12, 11, 13, 10, 14, 12, 11)
t.test(data_sample, mu = 10) # Test if mean is significantly different from 10
## 
##  One Sample t-test
## 
## data:  data_sample
## t = 3.2646, df = 7, p-value = 0.01378
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
##  10.44798 12.80202
## sample estimates:
## mean of x 
##    11.625
#Two-Sample Independent T-test:
group1 <- c(25, 28, 30, 27, 26)
group2 <- c(22, 24, 23, 21, 25)
t.test(group1, group2) # Assuming unequal variances (default)
## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = 3.7717, df = 7.7111, p-value = 0.005831
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.615302 6.784698
## sample estimates:
## mean of x mean of y 
##      27.2      23.0

Regression

  • formula: A symbolic description of the model to be fitted, typically in the form
    response ~ terms
  • response: The dependent variable.
  • terms: The independent variables or predictors, separated by +.
  • Example: y ~ x1 + x2 fits a model where y is predicted by x1 and x2. if there is no intercept (intercept = 0), you can specify it as y ~ x1 + x2 - 1 or y ~ x1 + x2 + 0. Otherwise the intercept is included by default.
  • data: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model.
  • na.action: a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.
  • method: the method to be used
lm_model1 <- lm(spendeduc ~ dem_score14, data = world) # fit a linear model predicting spending on education based on democracy score: spendeduc = intercept + slope*dem_score14
summary(lm_model1) # get a summary of the model
## 
## Call:
## lm(formula = spendeduc ~ dem_score14, data = world)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5267 -1.2725 -0.1958  0.7895  9.6860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.99632    0.43287   6.922 1.28e-10 ***
## dem_score14  0.26071    0.07088   3.678 0.000329 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.854 on 147 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.08428,    Adjusted R-squared:  0.07805 
## F-statistic: 13.53 on 1 and 147 DF,  p-value: 0.0003288
ggplot(world, aes(x = dem_score14, y = spendeduc))+
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(title = "Scatter plot of Education Spending vs Democracy Score",
       x = "Democracy Score (2014)",
       y = "Education Spending (% of GDP)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).

Fitting the line manually

#standardization 
standard_units <- function(x) {
  (x - mean(x)) / sd(x)
}

#calculate the correlation
correlation <- function(t, x, y) {
  one_over_n_minus1 <- 1/(length(t[[x]]) - 1)
  x_part <- standard_units(t[[x]])
  y_part <- standard_units(t[[y]])
  cor_out <- one_over_n_minus1*sum(x_part*y_part)
  cor_out}

#compute the slope and intercept
slope <- function(df, x, y) {
  df = df %>% filter(!is.na(df[[x]]), !is.na(df[[y]])) # remove NA values
  r <- correlation(df, x, y)
  r * sd(df[[y]]) / sd(df[[x]])
}

intercept <- function(df, x, y) {
  df = df %>% filter(!is.na(df[[x]]), !is.na(df[[y]])) # remove NA values
  mean(df[[y]]) - slope(df, x, y) * mean(df[[x]])}

model_slope <- slope(world, 'dem_score14', 'spendeduc')
model_slope
## [1] 0.2607091
model_intercept <- intercept(world, 'dem_score14', 'spendeduc')
model_intercept
## [1] 2.996325
r2<- function(df, x, y){
  df = df %>% filter(!is.na(df[[x]]), !is.na(df[[y]])) # remove NA values
    correlation(df, x, y)**2
}

r2(world, 'dem_score14', 'spendeduc')
## [1] 0.08427923

Output of lm()

  • summary(model): Provides a detailed summary of the model, including coefficients, R-squared, F-statistic, etc.
  • plot(model): Generates diagnostic plots for assessing model assumptions (e.g., residuals vs. fitted, Q-Q plot).
  • coefficients(model) or coef(model): Extracts the estimated regression coefficients.
  • residuals(model): Extracts the residuals of the model.
  • fitted.values(model) or predict(model): Extracts the predicted values from the model.
summary(lm_model1)
## 
## Call:
## lm(formula = spendeduc ~ dem_score14, data = world)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5267 -1.2725 -0.1958  0.7895  9.6860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.99632    0.43287   6.922 1.28e-10 ***
## dem_score14  0.26071    0.07088   3.678 0.000329 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.854 on 147 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.08428,    Adjusted R-squared:  0.07805 
## F-statistic: 13.53 on 1 and 147 DF,  p-value: 0.0003288
plot(lm_model1)

coefficients(lm_model1)
## (Intercept) dem_score14 
##   2.9963249   0.2607091
residuals(lm_model1)
##           2           3           4           5           6           7 
## -1.57454532  0.30515937 -1.26970028  0.12042507 -1.07305335 -0.64531361 
##           8           9          10          11          12          13 
##  0.17721966 -1.83413156 -0.84455992 -2.10322331  1.24165864  1.03625219 
##          14          15          16          18          19          20 
## -0.86933114  0.83402194  1.79416960  3.05189473  0.27964218 -0.65089693 
##          21          22          23          24          25          26 
##  0.53737501 -2.49148756  3.33551390 -2.64251425 -0.98534282 -0.46356324 
##          27          28          29          30          31          32 
##  0.66753728 -2.08478141 -1.48738850 -1.62985563 -1.87845210 -0.80396930 
##          33          35          36          37          38          39 
##  3.68597918 -1.94977411 -0.08981872  0.68337209 -0.90303874  9.68597918 
##          40          41          42          43          44          45 
##  2.17442799 -0.46635490  2.52861549  4.92415499 -2.53525439 -3.52668713 
##          46          47          48          49          50          51 
## -0.02016555 -1.09875512 -2.82910195 -1.63245503 -0.01421309  1.53383737 
##          52          53          54          55          56          57 
##  1.74109723  0.54947221  0.50757419 -0.17659100 -1.79148756 -1.61365168 
##          58          59          60          61          62          63 
## -0.84885125  0.75338670 -0.93860746 -1.31104459 -2.08105919  1.70050660 
##          64          67          68          69          70          71 
##  1.56288451 -1.38050548  0.60478253  2.00608223 -1.86114072 -1.30825292 
##          72          74          75          76          77          78 
##  1.28747115 -0.36970798  1.41446491 -0.74289109  1.27703509 -1.70285417 
##          79          80          81          83          84          85 
##  0.92340900 -1.02277265  2.66623758 -0.89763999 -0.18180518  2.23755958 
##          86          87          88          89          90          91 
## -1.27249194  0.05357127 -2.33115533  7.66735271 -1.58683479 -1.28701936 
##          92          93          94          95          96          97 
## -0.26207127 -1.61142143 -1.12575658 -1.24865898 -0.27193823 -0.18832675 
##          98          99         100         101         102         103 
## -0.70583040 -0.38367398  0.31651829 -1.52631799  0.06213852  3.55599379 
##         104         106         107         108         109         110 
##  0.37778107  1.66083883  0.78877084  1.87685051 -0.43990715  0.17815021 
##         111         112         113         115         116         117 
##  0.78950913 -1.28329714 -0.34437535  1.11483405  0.18244154 -1.30601498 
##         119         121         122         123         124         125 
## -1.04214510 -0.62836367 -2.00136221 -2.16132529 -0.04382164  0.27275146 
##         126         127         128         129         130         131 
## -0.52537974 -0.33786148  0.01987136  0.25637063  2.22918460  0.50031433 
##         132         133         134         135         136         137 
## -0.24568275 -0.38515825 -1.76840058 -1.31253655  0.23010745  0.06493019 
##         138         142         143         144         145         147 
## -0.69503290  4.09808408  1.16697587 -0.06617033  1.45004132 -0.11420539 
##         148         149         150         151         152         153 
##  2.29938378  0.49845322  2.21614145 -0.19577118 -0.61868129  2.55860088 
##         154         156         157         158         159         160 
## -1.43115533 -0.55722624  0.89063195 -2.78459684  0.43718274  0.38932456 
##         161         163         164         165         166 
## -2.32631799 -0.61811988  1.41465718  1.47629680 -3.26225585
fitted.values(lm_model1) # or predict(lm_model1)
##        2        3        4        5        6        7        8        9 
## 4.474545 3.994841 3.869700 4.779575 4.073053 5.345314 5.222780 3.734132 
##       10       11       12       13       14       15       16       18 
## 3.744560 4.503223 3.958341 5.063748 4.469331 4.265978 4.505830 5.048105 
##       19       20       21       22       23       24       25       26 
## 4.920358 4.750897 4.062625 3.791488 3.864486 4.242514 3.885343 5.363563 
##       27       28       29       30       31       32       33       35 
## 5.032463 3.384781 3.387389 5.029856 3.778452 4.703969 3.914021 3.749774 
##       36       37       38       39       40       41       42       43 
## 5.089819 3.916628 4.803039 3.914021 4.925572 5.066355 5.371385 3.775845 
##       44       45       46       47       48       49       50       51 
## 4.735254 4.526687 3.820166 4.698755 3.429102 3.632455 5.014213 3.966163 
##       52       53       54       55       56       57       58       59 
## 4.458903 5.350528 5.092426 3.976591 3.791488 4.513652 5.248851 4.646613 
##       60       61       62       63       64       67       68       69 
## 4.938607 4.511045 3.781059 3.499493 4.537115 4.680505 4.795217 5.493918 
##       70       71       72       74       75       76       77       78 
## 5.061141 4.808253 3.512529 5.269708 4.985535 5.042891 4.922965 5.102854 
##       79       80       81       83       84       85       86       87 
## 3.976591 3.822773 4.333762 5.097640 3.981805 4.362440 3.572492 4.946429 
##       88       89       90       91       92       93       94       95 
## 4.331155 4.732647 4.286835 3.987019 4.962071 5.311421 4.625757 4.148659 
##       96       97       98       99      100      101      102      103 
## 4.471938 4.688327 4.505830 5.183674 4.083482 5.126318 4.737861 4.644006 
##      104      106      107      108      109      110      111      112 
## 4.722219 4.039161 4.211229 4.623149 4.239907 5.321850 5.410491 4.383297 
##      113      115      116      117      119      121      122      123 
## 4.044375 5.585166 3.817558 4.206015 4.842145 4.628364 4.701362 4.761325 
##      124      125      126      127      128      129      130      131 
## 4.943822 5.027249 3.825380 4.737861 3.880129 3.843629 3.470815 4.599686 
##      132      133      134      135      136      137      138      142 
## 4.745683 4.185158 4.568401 4.912537 4.969893 5.035070 5.095033 3.801916 
##      143      144      145      147      148      149      150      151 
## 5.533024 5.366170 3.449959 3.614205 4.500616 4.401547 4.883859 3.895771 
##      152      153      154      156      157      158      159      160 
## 4.818681 4.641399 4.331155 4.357226 4.409368 3.684597 5.162817 5.110675 
##      161      163      164      165      166 
## 5.126318 4.318120 3.885343 3.723703 4.662256

Bootstrapping

bootstrap_slope <- function(dframe, x, y, repetitions) {
  # Create an empty numeric vector to store the bootstrapped slope estimates
  slopes <- numeric(repetitions)
  
  # Run the bootstrap procedure for the specified number of repetitions
  for (i in seq_len(repetitions)) {
    # Step 1: Take a bootstrap sample (resample rows with replacement)
    bootstrap_sample <- dframe[sample(nrow(dframe), replace = TRUE), ]
    
    # Step 2: Fit a linear regression model of y ~ x using the bootstrap sample
    model <- lm(as.formula(paste(y, "~", x)), data = bootstrap_sample)
    
    # Step 3: Extract the slope coefficient (the second coefficient in the model)
    slopes[i] <- coef(model)[[2]]
  }
  
  # Step 4: Calculate the 95% confidence interval from the bootstrap distribution
  left <- quantile(slopes, 0.025)   # 2.5th percentile
  right <- quantile(slopes, 0.975)  # 97.5th percentile
  
  # Step 5: Fit the regression on the original dataset to get the observed slope
  observed_model <- lm(as.formula(paste(y, "~", x)), data = dframe)
  observed_slope <- coef(observed_model)[[2]]
  
  # Step 6: Plot the histogram of all bootstrapped slope estimates
  hist(slopes, col = "lightblue", border = "black", 
       main = "Bootstrap Distribution of Slopes",
       xlab = "Slope Estimates")
  
  # Step 7: Add a thick yellow line at the bottom showing the confidence interval
  lines(c(left, right), c(0, 0), col = "yellow", lwd = 12)
  
  # Step 8: Print the observed slope and bootstrap confidence interval
  cat("Slope of regression line:", observed_slope, "\n")
  cat("Approximate 95%-confidence interval for the true slope:\n")
  cat(left, right, "\n")
  
  # Step 6-7 in ggplot2
  slope_df <- data.frame(slopes = slopes)
  ggplot(slope_df, aes(x = slopes)) +
    geom_histogram(fill = "lightblue", color = "black") +
    geom_vline(xintercept = observed_slope, color = "red", linetype = "dashed") +
    geom_segment(aes(x = left, xend = right, y = 0, yend = 0), 
                 color = "yellow", size = 4) +
    labs(title = "Bootstrap Distribution of Slopes",
         x = "Slope Estimates", y = "Count") +
    theme_minimal()
}

set.seed(123)
bootstrap_slope(world, "dem_score14", "spendeduc", 1000)

## Slope of regression line: 0.2607091 
## Approximate 95%-confidence interval for the true slope:
## 0.1219065 0.3983563
## Warning in geom_segment(aes(x = left, xend = right, y = 0, yend = 0), color = "yellow", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Confidence interval for predictions

bootstrap_prediction <- function(dframe, x, y, new_x, repetitions) {
  predictions <- numeric(repetitions)
  
  # Run bootstrap loop
  for (i in seq_len(repetitions)) {
    # Step 1: Resample with replacement
    bootstrap_sample <- dframe[sample(nrow(dframe), replace = TRUE), ]
    
    # Step 2: Fit linear model on bootstrap sample
    model <- lm(as.formula(paste(y, "~", x)), data = bootstrap_sample)
    
    # Step 3: Predict at new_x
    newdata <- data.frame(setNames(list(new_x), x))
    predictions[i] <- predict(model, newdata = newdata)
  }
  
  # Step 4: Compute 95% interval
  left <- quantile(predictions, 0.025)
  right <- quantile(predictions, 0.975)
  
  # Step 5: Prediction on original sample
  original_model <- lm(as.formula(paste(y, "~", x)), data = dframe)
  original <- predict(original_model, newdata = data.frame(setNames(list(new_x), x)))
  
  
  # Step 6: Plot histogram of predictions
  # hist(predictions, col = "lightblue", border = "black",
  #      main = paste("Bootstrap Predictions at x =", new_x),
  #      xlab = paste("Predicted y at x =", new_x))
  # 
  # # Draw yellow confidence interval line at bottom
  # lines(c(left, right), c(0, 0), col = "yellow", lwd = 12)
  
  # Step 6: Plot histogram with ggplot
  pred_df <- data.frame(predictions = predictions)
  p <- ggplot(pred_df, aes(x = predictions)) +
    geom_histogram(fill = "lightblue", color = "black") +
    geom_vline(xintercept = original, color = "red", linetype = "dashed") +
    geom_segment(aes(x = left, xend = right, y = 0, yend = 0),
                 color = "yellow", size = 4) +
    labs(title = paste("Bootstrap Predictions at", x, "=", new_x),
         x = paste("Predicted", y, "at", x, "=", new_x),
         y = "Count") +
    theme_minimal()
  
  print(p)   # <--- this forces display inside function
  
  # Step 7: Print results
  cat("Prediction at", x, "=", new_x, ":", original, "\n")
  cat("Approximate 95%-confidence interval:\n")
  cat(left, right, "\n")
}
set.seed(123)
bootstrap_prediction(world, "dem_score14", "spendeduc", new_x = mean(world$dem_score14), repetitions = 1000)
## Warning in geom_segment(aes(x = left, xend = right, y = 0, yend = 0), color = "yellow", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Prediction at dem_score14 = 5.564699 : 4.447092 
## Approximate 95%-confidence interval:
## 4.154091 4.754798
bootstrap_prediction(world, "dem_score14", "spendeduc", new_x = max(world$dem_score14), repetitions = 1000)
## Warning in geom_segment(aes(x = left, xend = right, y = 0, yend = 0), color = "yellow", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## Prediction at dem_score14 = 9.93 : 5.585166 
## Approximate 95%-confidence interval:
## 5.015232 6.093974

Prediction

new_data <- data.frame(dem_score14 = c(0, 5, 10, 15, 20))
predictions <- predict(lm_model1, newdata = new_data, interval = "confidence", level = 0.95)
predictions
##        fit      lwr       upr
## 1 2.996325 2.140874  3.851776
## 2 4.299870 3.983230  4.616510
## 3 5.603416 4.932775  6.274056
## 4 6.906961 5.572686  8.241236
## 5 8.210506 6.187673 10.233340
# Create a grid of values for dem_score14
dem_grid <- data.frame(dem_score14 = seq(min(world$dem_score14, na.rm = TRUE),
                                         max(world$dem_score14, na.rm = TRUE),
                                         length.out = 200))

# Get fitted values and 95% confidence interval
preds <- predict(lm_model1, newdata = dem_grid, interval = "confidence", level = 0.95)

# Combine with grid
plot_data <- cbind(dem_grid, preds)

# Plot
ggplot(world, aes(x = dem_score14, y = spendeduc)) +
  geom_point(color = "gray40", alpha = 0.7) +  # scatterplot of observed data
  geom_line(data = plot_data, 
            aes(x = dem_score14, y = fit), 
            color = "red", size = 1.2, inherit.aes = FALSE) + # regression line
  geom_ribbon(data = plot_data, 
              aes(x = dem_score14, ymin = lwr, ymax = upr), 
              fill = "blue", alpha = 0.2, inherit.aes = FALSE) +
  labs(x = "Democracy Score (2014)",
       y = "Education Spending",
       title = "95% Confidence Interval for Mean Prediction") +
  theme_minimal()
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).

lm_model2 <- lm(durable ~ dem_score14 , data = world) 
summary(lm_model2)
## 
## Call:
## lm(formula = durable ~ dem_score14, data = world)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.179 -18.047  -7.893  10.205 153.006 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -8.848      6.659  -1.329    0.186    
## dem_score14    5.776      1.122   5.148 8.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.8 on 145 degrees of freedom
##   (19 observations deleted due to missingness)
## Multiple R-squared:  0.1545, Adjusted R-squared:  0.1487 
## F-statistic:  26.5 on 1 and 145 DF,  p-value: 8.435e-07

Interpreting Regression Results: Regime Durability

Model 2: durable ~ dem_score14

  • Democracy score (dem_score14)
    • Estimate = 5.78, p < 0.001
    • More democratic countries have regimes that last about 6 years longer on average.
  • Intercept
    • Not significant → do not interpret substantively.
  • Model fit
    • R² = 0.15 → democracy explains about 15% of variation in regime durability.
#adding control variables
lm_model3 <- lm(durable ~ dem_score14 + gdpcap3_08, data = world) 
summary(lm_model3)
## 
## Call:
## lm(formula = durable ~ dem_score14 + gdpcap3_08, data = world)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.780 -15.515  -5.071   6.800 145.666 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     14.040      9.975   1.407 0.161532    
## dem_score14      3.859      1.295   2.979 0.003414 ** 
## gdpcap3_08Low  -22.192      6.509  -3.410 0.000853 ***
## gdpcap3_08Mid  -16.261      6.203  -2.622 0.009732 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.46 on 138 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.2486, Adjusted R-squared:  0.2323 
## F-statistic: 15.22 on 3 and 138 DF,  p-value: 1.305e-08

Model 3: durable ~ dem_score14 + gdpcap3_08

  • Democracy score (dem_score14)
    • Estimate = 3.86, p = 0.003
    • Effect remains positive but is smaller once GDP is included.
  • GDP per capita (gdpcap3_08)
    • Low GDP: regimes last ~22 years less than high-GDP regimes (p < 0.001).
    • Mid GDP: regimes last ~16 years less than high-GDP regimes (p < 0.01).
  • Model fit
    • R² = 0.25 → adding GDP improves the model and reduces error.

Key Takeaways

  • Democracy → longer-lasting regimes, even after controlling for development.
  • Economic development strongly predicts durability: richer countries’ regimes are much more stable.
  • Together, democracy and GDP explain about 25% of cross-national variation in regime durability.

A few things that might be careful about when running regression

  1. Missing data: If your dataset contains missing values (NA), R will automatically exclude those
  2. Categorical variables: If you include categorical variables (factors) in your regression model, R will automatically create dummy variables for them. Make sure to interpret the coefficients correctly.
  3. Multicollinearity: If your independent variables are highly correlated with each other, it can lead to unstable estimates of coefficients. You might want to check for multicollinearity using variance inflation factors (VIF) or correlation matrices. Example: gdp and log(gdp)
  4. Outliers: Outliers can have a significant impact on regression results. Consider checking for outliers and deciding how to handle them (e.g., removing them, transforming variables).
  5. Model assumptions: Linear regression relies on several assumptions (linearity, independence, homoscedasticity, normality of residuals). It’s important to check these assumptions using diagnostic plots and tests.

Lab 6: K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a supervised machine learning model that can be used for both regression and classification tasks. The algorithm is non-parametric, which means that it doesn’t make any assumption about the underlying distribution of the data.

The KNN algorithm predicts the labels of the test dataset by looking at the labels of its closest neighbors in the feature space of the training dataset. The “K” is the most important hyperparameter that can be tuned to optimize the performance of the model.

KNN is a simple and intuitive algorithm that provides good results for a wide range of classification problems. It is easy to implement and understand, and it applies to both small and large datasets. However, it comes with some drawbacks too, and the main disadvantage is that it can be computationally expensive for large datasets or high-dimensional feature spaces.

The KNN algorithm is used in e-commerce recommendation engines, image recognition, fraud detection, text classification, anomaly detection, and many more. In this tutorial, we will be using the KNN algorithm for a loan approval system.

How does K-Nearest Neighbors Classification work?

The KNN classification algorithm works by finding K neighbors (closest data points) in the training dataset to a new data point. Then, it assigns the label of the majority class among neighbors to new data points.

Let’s break down the algorithms into multiple parts.

First, it calculates the distance between the new data points and all the other data points in the training set and selects K closest points. The metric used for calculating distance can vary depending on the problems. The most used metric is the Euclidean distance.

After identifying K closest neighbors, the algorithm assigns the label of the majority class among those neighbors to the new data point. For example, if the two labels are “blue” and one label is “red” the algorithm will assign the “blue” label to a new data point.

Steps

  1. We will choose the value of K, which is the number of nearest neighbors that will be used to make the prediction.
  2. Calculate the distance between that point and all the points in the training set.
  3. Select the K nearest neighbors based on the distances calculated.
  4. Assign the label of the majority class to the new data point.
  5. Repeat steps 2 to 4 for all the data points in the test set.
  6. Evaluate the accuracy of the algorithm.

The value of “K” is provided by the user, and it can be used to optimize the algorithm’s performance. Smaller K values can lead to overfitting, and larger values can lead to underfitting. So, it is crucial to find optimal values that provide stability and the best fit.

data <- read.csv("data/loan_data.csv")
data <- subset(data, select = -c(purpose))
head(data,3)
##   credit.policy int.rate installment log.annual.inc   dti fico
## 1             1   0.1189      829.10       11.35041 19.48  737
## 2             1   0.1071      228.22       11.08214 14.29  707
## 3             1   0.1357      366.86       10.37349 11.63  682
##   days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec
## 1          5639.958     28854       52.1              0           0       0
## 2          2760.000     33623       76.7              0           0       0
## 3          4710.000      3511       25.6              1           0       0
##   not.fully.paid
## 1              0
## 2              0
## 3              0

Train and Test Split

We can split the dataset manually, but using the caTools library is much cleaner.

# Set a random seed for reproducibility — ensures the same random split each time
set.seed(123)

# Randomly sample 75% of the data to create the training set
# You can adjust the proportion as needed. Sometimes, we use 90/10 or 80/20 splits.
train <- data %>% slice_sample(prop = 0.75)

# Create the test set by taking all rows not included in the training set
# `anti_join()` keeps rows from `data` that do NOT match rows in `train`
# The `by = colnames(data)` argument ensures comparison across all columns
test  <- anti_join(data, train, by = colnames(data))

Feature Scaling / Standardization

We will now scale both the training and testing set. On the back end, the function (scale()) is using (x - mean(x)) / sd(x). We are only scaling features and removing target labels from both testing and training sets.

# how to standardize manually
standard_units <- function(x) {
  (x - mean(x)) / sd(x)
}
train_scaled <- scale(train[-13])             # reattach the target label
train_scaled_full <- train_scaled %>%         # scale numeric features
  cbind(train['not.fully.paid'])              # reattach the target label
test_scaled = scale(test[-13])

Manual Implementation of KNN

Before we get into the theoretical details of how this might work, we will define a series of functions to help us do it in practice. We will explain them in more detail below, but for now we first provide a brief description of each function followed by the actual functions.

  1. distance calculates the Euclidean distance between two points (these could be rows of data).
  2. all_distances calculates the distance between a row and all other rows in the data.
  3. table_with_distances returns the training data, but with a column showing the distance from each row of that data to some example we are interested in.
  4. closest returns the top-k (so the k closest) rows to our data point of interest.
  5. majority does a majority vote among the top-k closest rows to give us a prediction; in this case, the prediction will be either zero or one.
  6. classify uses the function above to tell us the classification of a given example (an instance).
### --- Manual KNN Implementation --- ###

############################################################
#                Manual KNN Classification Code             #
#   This script implements a simple K-Nearest Neighbors     #
#   algorithm without using the built-in `class::knn()`     #
#   function. Each step is explained in detail below.       #
############################################################

#-----------------------------------------------------------
# Function: distance()
# Purpose:  Compute the Euclidean distance between two numeric vectors.
# Inputs:   point1, point2 — numeric vectors of equal length.
# Output:   A single numeric value representing their distance.
# Formula:  sqrt(sum((x_i - y_i)^2))
#-----------------------------------------------------------
distance <- function(point1, point2) {
  sqrt(sum((point1 - point2)^2))
}


#-----------------------------------------------------------
# Function: all_distances()
# Purpose:  Compute distances from one test observation (point)
#           to all rows in the training dataset.
# Steps:
#   1. Remove the target label ('not.fully.paid') from training data.
#   2. Compute Euclidean distance between the test point and each training row.
# Inputs:  training — data frame with predictors and a target column.
#          point — numeric vector representing one test observation.
# Output:  A numeric vector of distances (length = number of training rows).
#-----------------------------------------------------------
all_distances <- function(training, point) {
  # Keep only predictor columns
  attributes <- training[, !(names(training) %in% "not.fully.paid")]
  
  # Compute distances row-wise
  distance_from_point <- function(row) {
    distance(point, as.numeric(row))
  }
  
  apply(attributes, 1, distance_from_point)
}


#-----------------------------------------------------------
# Function: table_with_distances()
# Purpose:  Create a copy of the training data that includes a new
#           column 'Distance' representing each observation's
#           distance from the test point.
# Inputs:   training — full training dataset.
#           point — test observation.
# Output:   A data frame identical to training but with an extra
#           'Distance' column.
#-----------------------------------------------------------
table_with_distances <- function(training, point) {
  train_copy <- training
  ad <- all_distances(train_copy, point)
  train_copy$Distance <- ad
  return(train_copy)
}


#-----------------------------------------------------------
# Function: closest()
# Purpose:  Identify the k nearest neighbors to a test point.
# Steps:
#   1. Compute distances for all training points.
#   2. Sort the training data by distance.
#   3. Return the top k closest observations.
# Inputs:   training — training dataset with target.
#           point — single test observation.
#           k — number of neighbors to consider.
# Output:   A data frame containing the k nearest neighbors.
#-----------------------------------------------------------
closest <- function(training, point, k) {
  with_dists <- table_with_distances(training, point)
  sorted_by_distance <- with_dists[order(with_dists$Distance), ]
  topk <- head(sorted_by_distance, k)
  return(topk)
}


#-----------------------------------------------------------
# Function: majority()
# Purpose:  Determine the most frequent class among the k nearest neighbors.
# Inputs:   topkclasses — data frame containing the k nearest neighbors,
#            including their 'not.fully.paid' class labels.
# Output:   Predicted class (1 or 0) based on majority vote.
# Note:     If the number of ones equals zeros, defaults to 0.
#-----------------------------------------------------------
majority <- function(topkclasses) {
  ones  <- topkclasses[topkclasses$not.fully.paid == 1, ]
  zeros <- topkclasses[topkclasses$not.fully.paid == 0, ]
  
  if (nrow(ones) > nrow(zeros)) {
    return(1)
  } else {
    return(0)
  }
}


#-----------------------------------------------------------
# Function: classify()
# Purpose:  Classify a single test observation using KNN logic.
# Steps:
#   1. Identify k nearest neighbors.
#   2. Use majority voting to determine the predicted class.
# Inputs:   training — training dataset with class labels.
#           p — single test observation (row vector without label).
#           k — number of neighbors to consider.
# Output:   Predicted class label (0 or 1).
#-----------------------------------------------------------
classify <- function(training, p, k) {
  topkclasses <- closest(training, p, k)
  return(majority(topkclasses))
}
# Apply manual KNN classification to all test observations
test_pred_manual <- apply(test_scaled[, -13], 1, function(row) classify(train_scaled_full, row, k = 3))
print(test_pred_manual)   # predicted labels for all test points
##    [1] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
##   [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
##   [75] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [149] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
##  [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [223] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
##  [334] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0
##  [371] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
##  [445] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
##  [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##  [556] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [593] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [630] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
##  [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [704] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
##  [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
##  [815] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [852] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [889] 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [926] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [963] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1000] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
## [1037] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1074] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1148] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [1185] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1222] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1259] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1296] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1333] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [1370] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [1407] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [1444] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
## [1481] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [1518] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
## [1555] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1592] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1629] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1703] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
## [1740] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1814] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1851] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1
## [1888] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1925] 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0
## [1962] 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1
## [1999] 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0
## [2036] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
## [2073] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
## [2110] 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
## [2147] 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0
## [2184] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
## [2221] 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2258] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0
## [2295] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [2332] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0
## [2369] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
# Convert to numeric vector
test_pred_manual <- as.numeric(test_pred_manual)

# Evaluate accuracy
accuracy_manual <- mean(test_pred_manual == test$not.fully.paid)
print(accuracy_manual)    # overall classification accuracy
## [1] 0.8025052

Training KNN Classifier and Predicting

The class library is quite popular for training KNN classification. It is simple and fast. We will provide a scaled train and test dataset, target column, and “k” hyperparameter.

library(class)
# Perform K-Nearest Neighbors (KNN) classification
test_pred <- knn(
  train = train_scaled,          # specify which are the training data - the standardized one
  test  = test_scaled,           # specify which are the testing data - the standardized one
  cl    = train$not.fully.paid,  # specify the true class labels from the training set
  k     = 3                      # number of nearest neighbors to consider 
                                 # this is the main parameter to tune
)

Model Evaluation

To evaluate model results, we will display a confusion matrix using a table function. We have provided actual (test target) and predicted labels to the table function, and as we can see, we have quite good results for the majority class.

The KNN algorithm is not good at dealing with imbalanced data, and that is why we see poor performance in minority classes.

actual <- test$not.fully.paid
cm <- table(actual,test_pred)
cm
##       test_pred
## actual    0    1
##      0 1887  134
##      1  339   35
accuracy <- sum(diag(cm))/length(actual) # calculate accuracy by adding up the correctly predicted ones. This is only for the binary outcomes. Think about false positive and false negative!!!
sprintf("Accuracy: %.2f%%", accuracy*100)
## [1] "Accuracy: 80.25%"
k_values <- k_values <- c(seq(1, 21, by = 2), seq(21, 100, by = 20))  # try odd K to avoid ties
accuracy <- numeric(length(k_values))

for (i in seq_along(k_values)) {
  k <- k_values[i]
  pred <- knn(
    train = train_scaled,
    test = test_scaled,
    cl = train$not.fully.paid,
    k = k
  )
  accuracy[i] <- mean(pred == test$not.fully.paid)
}

# ---- Create a data frame and plot ----
acc_df <- data.frame(k = k_values, accuracy = accuracy)

# Identify the row with the maximum accuracy
max_point <- acc_df[which.max(acc_df$accuracy), ]

# Plot accuracy vs. K and highlight the maximum point in red
ggplot(acc_df, aes(x = k, y = accuracy)) +
  geom_line() +                                    # line connecting accuracy values
  geom_point() +                                   # regular points
  geom_point(data = max_point, color = "red", size = 1) +  # highlight max accuracy in red
  ylim(0.7, 1) +
  labs(
    title = "KNN Accuracy across Different K Values",
    x = "Number of Neighbors (K)",
    y = "Accuracy"
  ) +
  theme_minimal(base_size = 14)

Advantages and Disadvantages of using KNN

Advantages

  • It is a simple algorithm to understand and implement.
  • It is versatile and can be used for both regression and classification tasks.
  • It provides interpretable results that can be visualized and understood as the predicted class is based on the labels of the nearest neighbors in the training data.
  • KNN does not make assumptions about the decision boundary between classes, and this feature allows it to capture nonlinear relationships between features.
  • The algorithm does not make assumptions about the distribution of the data, which makes it suitable for a wide range of problems.
  • KNN does not build the model. It stores the training data and uses it for prediction.

Disadvantages

  • It is computationally and memory expensive for large and complex datasets.
  • KNN performance drops for Imbalanced data. It shows biases toward the majority class, which can result in poor performance for minority classes.
  • It is not suitable for noisy data. Since the nearest neighbors of a data point may not be representative of the true class label.
  • It is not suitable for high-dimensional data, as high dimensionality can cause the distance between all data points to become similar.
  • Finding the optimal number of K neighbors can be time-consuming.
  • KNN is sensitive to outliers, as it chooses neighbors based on evidence metric.
  • It is not good at handling missing values in the training dataset.

Lab 7: Multiple Linear Regression

library(caTools)
library(FNN) #KNN regression
## 
## Attaching package: 'FNN'
## The following objects are masked from 'package:class':
## 
##     knn, knn.cv

Until now, we have been interested in predicting categories for observations. But it will often be the case that we want to predict continuous numerical outcomes like incomes, prices or temperatures.

Previously, we discussed regression in the context of fitting models of this type:

\[ Y = a + bX \] where \(Y\) was the outcome, \(a\) was our intercept or constant, and \(b\) was our slope coefficient. In that case, we had one predictor or \(X\) variable. We extend that now to multiple predictors:

\[ Y = a +b_1X_1 + b_2X_2 + \ldots + b_kX_k \]

Here, \(X_1\) is our first variable, \(X_2\) is a second, all the way through to \(X_k\) (our \(k\)th variable). Notice that each variable has its own slope coefficient: its own \(b\). Note also terminology: we sometimes refer to the \(X\) variables as the “right hand side”, because they are on the on right hand side of the equals sign. We sometimes refer to the dependent variable, the \(Y\) variable, as the left hand side (for obvious related reasons).

Now we have a multiple regression which involves one \(Y\) (still), but many \(X\)s. As a side-note, this is not a “multivariate” regression, which describes something else. In any case, a given estimated \(b\) tells us

the (average) effect on \(Y\) of a one unit change in that \(X\), holding the other variables constant

Implicitly, this is allowing us to compare like with like: when we talk of the effect of, say, size of the first floor on the sales price (\(Y\)), it is as if we are comparing houses which have the same size second floor, the same size garage and so on. We are then asking “what would happen to these similar houses (in terms of their price, on average) if we altered this one factor?”

sales <- read.csv("data/house.csv") %>%
  subset(Bldg.Type == "1Fam") %>%
  dplyr::select(SalePrice, X1st.Flr.SF, X2nd.Flr.SF,
                Total.Bsmt.SF, Garage.Area, Wood.Deck.SF,
                Open.Porch.SF, Lot.Area,
                Year.Built, Yr.Sold) %>% 
  na.omit()

# generate train and test sets
set.seed(123)
split <- sample.split(sales$SalePrice, SplitRatio = 0.7)
train <- subset(sales, split == TRUE)
test  <- subset(sales, split == FALSE)

Fitting to the training data

training_model <- lm(SalePrice ~ X1st.Flr.SF + X2nd.Flr.SF + Total.Bsmt.SF +
               Garage.Area + Wood.Deck.SF + Open.Porch.SF +
               Lot.Area + Year.Built + Yr.Sold, data= train)
summary(training_model)
## 
## Call:
## lm(formula = SalePrice ~ X1st.Flr.SF + X2nd.Flr.SF + Total.Bsmt.SF + 
##     Garage.Area + Wood.Deck.SF + Open.Porch.SF + Lot.Area + Year.Built + 
##     Yr.Sold, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -640607  -21418   -3490   16736  263072 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.637e+05  1.606e+06  -0.413    0.680    
## X1st.Flr.SF    7.886e+01  4.819e+00  16.365  < 2e-16 ***
## X2nd.Flr.SF    7.564e+01  2.665e+00  28.384  < 2e-16 ***
## Total.Bsmt.SF  3.955e+01  4.224e+00   9.364  < 2e-16 ***
## Garage.Area    6.588e+01  6.462e+00  10.194  < 2e-16 ***
## Wood.Deck.SF   3.665e+01  8.801e+00   4.165 3.26e-05 ***
## Open.Porch.SF -1.895e+01  1.653e+01  -1.146    0.252    
## Lot.Area       8.020e-02  1.505e-01   0.533    0.594    
## Year.Built     6.723e+02  4.140e+01  16.239  < 2e-16 ***
## Yr.Sold       -3.337e+02  7.990e+02  -0.418    0.676    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44360 on 1819 degrees of freedom
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.7461 
## F-statistic:   598 on 9 and 1819 DF,  p-value: < 2.2e-16

Each element of Coefficients in the table is a coefficient. These are the slope estimates. So, for example, the \(b\) on Garage.Area is 6.588e+01 (that means 6.588 \(\times 10^1\), which is 65.88). This tells us that, holding all other variables constant, a one square foot increase in the garage area of a house is expected to increase the sale price by $65.88 (on average).

The Pr(>|t|) gives the \(p\)-value for each coefficient. And we see there is also (bottom left) the \(R\)-squared (Multiple R-squared) should we want it.

But to reiterate, we are not particularly interested in interpreting the coefficients here: they are helpful to the extent they aid prediction.

Predicting on the test data

testX <- test[, !(names(test) == "SalePrice")]
y_new <- predict.lm(training_model, testX)

plot(y_new, test$SalePrice, col="black", ylab = "actual (test) data", xlab="fitted (predicted price)",
     xlim=c(0,500000), ylim=c(0,500000),
     pch=21, bg="lightblue", cex=1.3)
abline(0,1)

Using KNN

k <- 5
y_knn <- knn.reg(train = train[, !(names(train) == "SalePrice")],
                 test  = test[, !(names(test) == "SalePrice")],
                 y     = train$SalePrice,
                 k     = k)
plot(y_knn$pred, test$SalePrice, col="black", ylab = "actual (test) data", xlab="fitted (predicted price)",
     xlim=c(0,500000), ylim=c(0,500000),
     pch=21, bg="lightgreen", cex=1.3)
abline(0,1)

Which model performs better? RMSE

Root Mean Square Error (RMSE) is a popular way to compare models, and also to get an absolute sense of how well a given model fits the data. We define each term in turn: 1. error is the difference between the actual \(Y\) and the predicted \(Y\) and is \(( \mbox{actual } Y- \hat{Y})\). Ideally we would like this to be zero: that would imply our predictions were exactly where they should be. 2. square simply means we square the error. We do this for reasons we met above in other circumstances: if we don’t square the error, negative and positive ones will cancel out, making the model appear much better than it really is. This is \(( \mbox{actual } Y- \hat{Y}) ^2\) 3. mean is the average. We want to know how far off we are from the truth, on average. This is \(\frac{1}{n}\sum ( \mbox{actual } Y- \hat{Y}) ^2\) 4. root simply means we take the square root of the squared error. We do this to convert everything back to the underlying units of the \(Y\) variable.

Thus, RMSE is

\[ = \sqrt{\frac{1}{n}\sum ( \mbox{actual } Y- \hat{Y})^2} \]

Note that this quantity is sometimes known as the Root Mean Square Deviation, too.

Features of RMSE

Clearly, a lower RMSE is better than a higher one. The key moving part is the difference between the prediction for \(Y\) and the actual \(Y\). If that is zero, the model is doing as well as it possibly can.

The size of the RMSE is proportional to the squared error. That is, it is proportional to \(( \mbox{actual } Y- \hat{Y})^2\). This means that larger errors are disproportionately costly for RMSE. For example, from the perspective of RMSE it is (much) better to have three small errors of 2 each (meaning \(2^2+2^2+2^2=12\)) rather than, say, one error of 1, one of zero and one of 5 (meaning \(1^2+0+5^2=26\)). This is true even though the total difference between the actual and predicted \(Y\) is 6 units in both cases.

A related metric you may see in applied work is the Mean Absolute Error (MAE). This $| Y- | $. The \(|\cdot|\) bars tell you to take the absolute value of what is inside them. Like the RMSE, MAE is on the same underlying scale as the data. Unlike the RMSE, however, the MAE does not punish bigger errors disproportionately more (but it does punish them more).

Intuitively, the RMSE tells us on average, based on the attributes, how far off the model was from the outcomes in the test set.

#RMSE for Mutiple linear regression
MSE_regression <- mean((y_new - test$SalePrice)**2) 
RMSE_regression <- MSE_regression**(.5)

cat("\n the regression RMSE was:",
      RMSE_regression,"dollars")
## 
##  the regression RMSE was: 28870.4 dollars
rmse_nn <- mean((test$SalePrice - y_knn$pred) ** 2) ** 0.5
cat("\n RMSE of the NN model is:", rmse_nn, "dollars")
## 
##  RMSE of the NN model is: 33487.65 dollars

Interaction Terms

Suppose now that a new researcher thinks the effect of \(X\) is not simply constant and in addition to the effect of \(Z\). In particular, the she thinks the effect of \(X\) depends on the value of \(Z\), and the effect of \(Z\) depends on the value of \(X\). For example, she might think that for high literacy countries, increasing democraticness has a larger effect than for low literacy countries. So the effect is not constant.

Theoretically, this is a statement that we need an interaction term, which is a product of the two variables multiplied by each other, on which we place a new coefficient term \(\beta_3\)

\[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \beta_3 (X \cdot Z) + \varepsilon \] This model estimates four mean (average) values, one for each combination of \(X\) and \(Z\):

\(X\) \(Z\) Expected \(Y\) Interpretation
0 0 \(\beta_0\) baseline: \(X\) and \(Z\) are zero
1 0 \(\beta_0 + \beta_1\) Average \(Y\) when \(X=1\) and \(Z=0\)
0 1 \(\beta_0 + \beta_2\) Average \(Y\) when \(Z=1\) and \(X=0\)
1 1 \(\beta_0 + \beta_1 + \beta_2 + \beta_3\) Average \(Y\) when \(X=1\) and \(Z=1\)

In terms of how we interpret the coefficients, it is:

  • \(\beta_1\): effect of \(X\) when \(Z = 0\)
  • \(\beta_2\): effect of \(Z\) when \(X = 0\)
  • \(\beta_3\): extra effect of \(X\) when \(Z = 1\) (or vice versa)
world <- read.csv("data/world_data.csv")

mod1 <-lm(effectiveness ~ literacy + polity, data=world)
summary(mod1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 2.2821589 6.77394855 0.3369023 7.367256e-01
## literacy    0.4765802 0.08250627 5.7762907 5.210885e-08
## polity      1.1765599 0.26985790 4.3599239 2.596470e-05

The coefficients on both variables (holding the other one constant) suggest increasing either increases how effective we think governments will be. So far so good.

As a theoretical matter, the analyst has put together an additive model of this form:

\[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon \] Here, \(X\) was literacy, \(Z\) was democraticness. An assumption underlying this model is that the effect of \(X\) is constant regardless of the value of \(Z\). Specifically that effect was 0.477. And vice versa: the effect of \(Z\) is constant regardless of the value of \(X\). Specifically, that effect was 1.17. To be explicit, if the democraticness of the country was 10 (approximately what it is for Australia), the effect of increasing literacy by one unit (1 percentage point) was to increase effectiveness by 0.477. Similarly, if the democraticness of the country was -10 (approximately what it is for Qatar), the effect of increasing literacy by one unit (1 percentage point) was to increase effectiveness by 0.477

More technically, we say that we are estimating parallel shifts in the outcome across the levels of \(X\) and \(Z\). We can see this more easily if we make these variable into binary (that is, zero or one) versions of themselves: we’ll use being above or below 90 as the cut point for literacy, and being above or below 9 as the cut point for democracy.

literacy2 <-  ifelse(is.na(world$literacy), NA, ifelse(world$literacy >= 90, 1, 0))
polity2 <- ifelse(is.na(world$polity), NA, ifelse(world$polity >= 9, 1, 0))
mod2 <-lm(effectiveness ~ literacy2 + polity2, data=world)
summary(mod2)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 32.984262   2.129727 15.487555 1.765117e-31
## literacy2    9.210441   3.459802  2.662130 8.729977e-03
## polity2     26.185754   3.710469  7.057263 8.663469e-11
#constant slopes and parallel lines

We can be more concrete with our specific case: we’re modeling effectiveness based on literacy (\(X = 1\) for high literacy countries) and democraticness (\(Z = 1\) for high democracy countries):

\[ \text{Effectiveness} = \beta_0 + \beta_1 \cdot \text{Literacy2} + \beta_2 \cdot \text{Polity2} + \beta_3 \cdot (\text{Literacy2} \times \text{Polity2}) + \varepsilon \]

  • \(\beta_1\): literacy gap among non-democracies
  • \(\beta_2\): democracy premium among low literacy countries
  • \(\beta_3\): how much more or less the democratic premium is for highly literate countries
mod3 <-lm(effectiveness ~ literacy2*polity2, data=world)
summary(mod3)$coef
##                    Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)       34.830575   2.135825 16.3077836 2.468138e-33
## literacy2          3.353170   3.804176  0.8814444 3.796917e-01
## polity2            7.722616   6.754072  1.1434014 2.549571e-01
## literacy2:polity2 25.705144   7.969351  3.2255002 1.588115e-03
mod4 <-lm(effectiveness ~ literacy2 + polity2 + literacy2*polity2, data=world)
summary(mod4)
## 
## Call:
## lm(formula = effectiveness ~ literacy2 + polity2 + literacy2 * 
##     polity2, data = world)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.151 -10.030  -0.926  10.845  61.816 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         34.831      2.136  16.308  < 2e-16 ***
## literacy2            3.353      3.804   0.881  0.37969    
## polity2              7.723      6.754   1.143  0.25496    
## literacy2:polity2   25.705      7.969   3.226  0.00159 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.95 on 131 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.4695, Adjusted R-squared:  0.4574 
## F-statistic: 38.65 on 3 and 131 DF,  p-value: < 2.2e-16

The model output (coefficients) can be interpreted as follows:

  • (Intercept): average effectiveness for low-literacy non-democracies (\(34.83\)).
  • Literacy2: difference in effectiveness when moving from low to high literacy among non-democracies (\(3.35\)).
  • Polity2: difference in effectiveness between non-democracies and democracies among low-literacy countries (\(7.72\)).
  • Literacy2:Polity2: extra effect of higher literacy among democracies (\(25.71\)).

In terms of our substantive interpretation of the interaction term:

  • If positive, high-literacy countries gain an additional boost from being democracies (as opposed to non-democracies).
  • If positive, democracies gain an additional boost from being high-literacy.
  • If negative, high-literacy countries experience a reduction in effectiveness from being democracies.
  • If negative, democracies experience a reduction in effectiveness from being high-literacy (as opposed to low-literacy).

In terms of predicted averages, we have:

Literacy Democracy Expected Effectiveness Interpretation Example
0 0 \(34.83\) Baseline: low literacy, non-democracy Angola
1 0 \(34.83 + 3.35 = 38.18\) Average effectiveness for high-literacy non-democracy Vietnam
0 1 \(34.83 + 7.72 = 42.55\) Average effectiveness for low-literacy democracy India
1 1 \(34.83 + 3.35 + 7.72 + 25.705 = 71.60\) Average effectiveness for high-literacy democracy Australia

Clearly, there is an extra “bump” in effectiveness for democracies that are high-literacy.
Indeed, examining the regression output shows that only the interaction effect itself is statistically significant, while the main effects of literacy and democracy are not (\(p > 0.05\)).

Taken literally, this implies that there is:

  • no general effect of democracy on effectiveness,
  • no general effect of literacy on effectiveness, and
  • a strong combined effect of being a high-literacy democracy.

Working with continuous predictors

The case where there were two variables—\(X\) and \(Z\)—and they took binary values (0/1) was particularly easy to conceptualize. But interaction models can cope with both variables being continuous, or one being continuous and one binary etc. Indeed, interaction models can also deal with triple (that is, three variable) or more interactions. In these cases, as you might expect, one needs more care with interpretation.

But just to see how it might work, let’s look at our original model with a continuous interaction:

mod_cont <-lm(effectiveness ~ literacy*polity, data=world)
summary(mod_cont)$coef
##                    Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     23.60800848 8.40210161  2.809774 0.0057182269
## literacy         0.21202640 0.10312621  2.055989 0.0417695329
## polity          -4.58591530 1.48396697 -3.090308 0.0024426826
## literacy:polity  0.06688063 0.01696481  3.942316 0.0001307131

As before: \(\hat{\beta}_1\) is the effect of literacy on effectiveness when polity is zero: 0.21 \(\hat{\beta}_2\) is the effect of polity on effectiveness when literacy is zero: -4.59 \(\hat{\beta}_3\) is how the effect of polity changes for a one-unit change in literacy, or how the effect of literacy changes for a one-unit change in polity. What about the marginal effect of each variable? Well, holding polity fixed the change in effectiveness with respect to literacy is

\[ \frac{\Delta \text{Effectiveness}}{\Delta \text{Literacy}} = \beta_1+ \beta_3\text{polity} \]

So, it depends on the specific level of polity. Similarly, holding literacy fixed the change in effectiveness with respect to polity is

\[ \frac{\Delta \text{Effectiveness}}{\Delta \text{Polity}} = \beta_2+ \beta_3\text{literacy} \]

Immediately then, we see that in order to talk about the slope of the line, we need to specify at what values (of the other variable) we are assessing it. Some natural choice might be the maximum, the minimum and the mean. For example, here is effect of literacy on effectiveness, holding polity at these various values.

# Fit model
mod_cont <- lm(effectiveness ~ literacy * polity, data = world)

# Compute representative polity values
p_min  <- min(world$polity, na.rm = TRUE)
p_mean <- mean(world$polity, na.rm = TRUE)
p_max  <- max(world$polity, na.rm = TRUE)

# Plot predicted lines directly using ggplot's stat_smooth
ggplot(world, aes(x = literacy, y = effectiveness)) +
  geom_point(alpha = 0.4) +
  stat_smooth(method = "lm", formula = y ~ x * polity,
              aes(color = "Observed fit"), se = FALSE) +
  geom_abline(intercept = coef(mod_cont)[1] + coef(mod_cont)[3]*p_min,
              slope     = coef(mod_cont)[2] + coef(mod_cont)[4]*p_min,
              color = "red", size = 1, linetype = "dashed") +
  geom_abline(intercept = coef(mod_cont)[1] + coef(mod_cont)[3]*p_mean,
              slope     = coef(mod_cont)[2] + coef(mod_cont)[4]*p_mean,
              color = "blue", size = 1) +
  geom_abline(intercept = coef(mod_cont)[1] + coef(mod_cont)[3]*p_max,
              slope     = coef(mod_cont)[2] + coef(mod_cont)[4]*p_max,
              color = "green", size = 1, linetype = "dotted") +
  labs(
    x = "Literacy",
    y = "Predicted Effectiveness",
    title = "Effect of Literacy at Different Levels of Democracy",
    color = "Polity Level"
  ) +
  annotate("text", x = Inf, y = Inf, label = "Red=min, Blue=mean, Green=max",
           hjust = 1.1, vjust = 2, size = 3.5) +
  theme_minimal()
## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Failed to fit group 1.
## Caused by error:
## ! object 'polity' not found
## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point()`).

Notice that these lines have different slopes: the effect of increasing literacy is positive at he mean level of polity and at the maximum level. But it is negative at the minimum level of democracy: that is, the predicted effectiveness declines as literacy increases for those cases.

Of course, how confident we are in these effects is partly a product of how much data we have at the various values of interest. With that in mind, it might make sense to plot 95% confidence intervals on our predictions. Here we do that in the case of the maximum value of democracy, specifically:

# Predict effectiveness across literacy, fixing polity at its max
p_max <- max(world$polity, na.rm = TRUE)
pred_data <- data.frame(
  literacy = seq(min(world$literacy, na.rm = TRUE),
                 max(world$literacy, na.rm = TRUE),
                 length.out = 100),
  polity = p_max
)

# Generate predictions with 95% confidence intervals
pred_data <- cbind(pred_data,
                   predict(mod_cont, pred_data, interval = "confidence"))

# Plot predicted line with 95% CIs
ggplot(pred_data, aes(x = literacy, y = fit)) +
  geom_line(color = "blue", size = 1) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2, fill = "blue") +
  labs(
    x = "Literacy",
    y = "Predicted Effectiveness",
    title = "Predicted Effect of Literacy (Polity fixed at maximum)",
    subtitle = "Shaded area shows 95% confidence interval"
  ) +
  theme_minimal()

Lab 8: Instrumental Variables

library(AER) #package for instrumental variables ivreg()
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(rddtools) #package for regression discontinuity designs
## Loading required package: np
## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-18)
## [vignette("np_faq",package="np") provides answers to frequently asked questions]
## [vignette("np",package="np") an overview]
## [vignette("entropy_np",package="np") an overview of entropy-based methods]
## 
## Please consider citing R and rddtools,
## citation()
## citation('rddtools')

Example: Charter Schools and Success

A popular example of a tricky causal inference problem is the effect of attending a charter school—a type of school that is funded by the government, but are mostly autonomous from local and state control. Are these schools more effective than the usual “public school” alternatives? We might see their students do better on, say, standardized mathematics tests, but it’s hard to know if that’s because of the charter school or because the types of students who opt to go to charter school are systematically different in some important way from those that don’t.

That is, we might like to estimate a regression of this form:

\[ \text{math scores}_i = \alpha + \beta\;\text{attend charter school}_i+ \epsilon \] Where some people (some \(i\)) have the treatment (\(\text{attend charter school}==1\)) and some don’t. But it’s not hard to think of things (confounders) in the error term that might affect attending a charter school and your math score: for example, a supportive, pro-education family environment.

“As if” random assignment: Compliers and Non-compliers

How can we get out of this impasse? In general, it would be great if we could randomly assign a treatment like smoking or democracy or going to a charter school to the units. But this may be difficult or impossible. The instrumental variable approach instead seeks a variable \(Z\) such that the treatment is “as if” random for units. By “as if” we mean that \(Z\) is unrelated (uncorrelated) to other pre-existing factors that we would like to control for (e.g. family environment).

One great example of such a \(Z\)—an “instrument”—is a lottery. For instance, a lottery to go on The Hajj pilgrimage or to serve in Vietnam. Suppose that there is, indeed, a lottery for going to a charter school.

Straight off the bat, we need to consider groups for whom the lottery does not affect treatment as intended:

  • Always-takers: \(T=1\) regardless of \(Z\) (e.g., attend the charter even when they “lose,” \(Z=0\)).
  • Never-takers: \(T=0\) regardless of \(Z\) (e.g., do not attend even when they “win,” \(Z=1\)).

Only compliers (\(T=Z\)) are shifted by the instrument; under monotonicity we rule out defiers (\(T=1-Z\)).

In other words, these groups are not “complying” with their lottery assignment, and we call them non-compliers. These groups mean that the lottery result does not perfectly determined treatment status. In general, we cannot learn much about Charter treatment effects from students who would always go to the Charter school or who would never go, whether they won the lottery or not. And we will rule out people—“defiers”—who do the exact opposite of their treatment assignment (go if they lose, don’t go if they win a place).

But we can learn from people for whom the lottery made the difference between whether they went to the charter school or not (subject to some other assumptions). These are the compliers (they do what their treatment assignment tells them to do) and we will compare those (randomly) in treatment and those (randomly) in control in terms of their math scores.

An important final point here is that we don’t actually need \(Z\) to be a lottery, specifically. It can be anything unrelated to pre-existing relevant \(X\)-variables—that is, stuff that affects treatment assignment and outcome. A classic example textbooks give is trying to determine the effect of smoking on health (as opposed to health on smoking). Suppose that some towns introduce a tax on cigarettes, and some do not. So long as we don’t think taxes affect health directly, then the tax \(Z\) is plausibly an instrument: that is, tax affects health only through reduces how much people smoke. So we may be able to calculate the causal effect by comparing the health outcomes of those towns that implemented a tax versus those that didn’t.

Assumptions

In keeping with the literature we will say:

  • \(Y\) is the outcome. We will have it be continuous—like a math score.
  • \(X\) (or \(T\)) is the treatment. We will have this be binary—like attending a charter school, or not.
  • \(Z\) is the instrumental variable (the instrument), and it is binary—like “winning” the lottery (being sufficiently high in the lottery) or not.

For an instrumental variables (IV) approach to work, we need to make the following assumptions:

  1. \(Z\) must have a causal effect on \(X\). This is the (strong) first stage requirement.
  2. \(Z\) must not be correlated with other factors we would like to control for. That is, \(Z\) to be uncorrelated with the original error term in the original regression equation. We call this the independence assumption.
  3. \(Z\) must affect \(Y\) only via the effect on \(X\). That is, to the extent \(Z\) changes the outcome, it must not do so directly—it must go via units allocation to the treatment (or not). This is the exclusion requirement.

Problems

In practice, it is very hard to find suitable instruments that satisfy all these requirements. Obvious violations include:

  • weak instruments where \(Z\) doesn’t affect \(X\) very much (or at all).

Suppose, for example, that the smoking tax (\(Z\)) just doesn’t affect people’s consumption of cigarettes (\(X\))—they are all addicted and don’t care about price. In that case, the instrument isn’t affecting treatment status, whether or not treatment causes a big effect on outcomes. Weak instruments can be very misleading: the estimates we get can be even more biased than for a naive linear regression.

There is generally no “solution” to this problem, though one can at least look at how correlated \(Z\) and \(X\) actually are.

  • invalid instruments (also “bad” instruments) where the \(Z\) is correlated with other things we should be controlling (violates independence) for and/or affects the outcome directly (violates the exclusion restriction).

In the charter school case, it could be that the same parts of a city or county that use lotteries for charter schools are also richer. In the extreme, suppose that poorer parts of the county don’t use lotteries at all (and send no one to charter schools). But now kids who are being allocated to charter schools by lottery in the county are already richer and potentially more likely to do well. This will make the charter schools look more efficacious than they really are.

Again, there isn’t much in the way of a “solution” here either, although one can check how “balanced” (how equal, essentially) those who win (\(Z=1\)) vs lose (\(Z=0\)) the lottery are on pre-treatment covariates. If it looks like winners are already much richer, we may have a problem.

We could obtain an exclusion restriction violation if doing well in the lottery affected math performance directly. For example, perhaps “winning” the lottery makes one feel more successful as a parent or student (even though it was random). That boost in confidence may help one do better on the math tests regardless of whether you go to a charter school.

There really isn’t much one can do to convince people the exclusion restriction holds. It mostly comes down to theory, though one can sometimes check whether \(Z\) affects other things it shouldn’t affect if the exclusion restriction truly holds.

Estimating IV

We’ll start this section with some data. In this dataset, we have

  • score which is math achievement
  • charter which is going to a charter school or not
  • lottery which is getting lotteried into charter school consideration, or not.

Let’s load it, and start with the naive regression of math scores on charter attendance:

schooling <- read.csv("data/IVschooling.csv")
naive_model <- lm(score ~ charter, data=schooling)
summary(naive_model)
## 
## Call:
## lm(formula = score ~ charter, data = schooling)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.76986 -0.68653 -0.00633  0.61983  3.02814 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.11664    0.04600  -2.536   0.0115 *  
## charter      3.48682    0.02275 153.277   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 498 degrees of freedom
## Multiple R-squared:  0.9792, Adjusted R-squared:  0.9792 
## F-statistic: 2.349e+04 on 1 and 498 DF,  p-value: < 2.2e-16

This effect is positive as we would expect, but we worry about confounding. In this particular case, we know there was a lottery, but we can imagine some non-compliance: for example, richer students found a way to go to the charter anyway—even though they were not lotteried in.

We can get a sense of the non-compliance, just by comparing charter and lottery:

cor(schooling$charter, schooling$lottery)
## [1] 0.1098809

This is less than one, so it looks like the lottery doesn’t perfectly predict treatment. We need to try to incorporate the IV more systematically.

Traditionally, we estimate IV models using two-stage least squares (2SLS). The two stages are:

First Stage

Isolate the part of \(X\) (the treatment) caused by the instrument (\(Z\)). This is what we want to have in the regression of the outcome on the treatment. In practice, we fit this regression:

\[ X= \gamma_0+ \gamma_1Z + v \]

This is exactly what it looks like: we are regressing \(X\) (our treatment) as if its an outcome, on our instrument (\(Z\)) as if that’s the treatment in this equation. Here the \(\gamma\) are just coefficients, like \(\beta\)s. For every unit, we will now have a predicted \(X_i\), \(\hat{X}_i\). For our example:

stage1 <- lm(charter ~ lottery, data=schooling)
summary(stage1)$coef
##                Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) -0.07129664  0.1234179 -0.5776848 0.56373805
## lottery      0.44412341  0.1800235  2.4670301 0.01395931
X_hat <- fitted(stage1)

This last line makes the predictions for \(X_i\).

Second Stage

Take those \(\hat{X}_i\)s and put them in the second stage equation:

\[ Y = \alpha + \beta \hat{X} + \epsilon \] Even though \(X\) is endogenous, \(\hat{\beta}\)—called the instrumental variables estimate—from this equation is a good (technically, “consistent”) estimate of the causal effect of \(X\) on \(Y\). For our example:

stage2 <- lm(score ~ X_hat, data=schooling)
summary(stage2)$coef
##               Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) -0.0309724  0.3732842 -0.0829727 0.93390659
## X_hat        2.8635073  1.4312270  2.0007360 0.04596316

This has reduced the coefficient size, but we are more confident we’ve captured the true causal effect because we are now relying only on the compliers: that is, the kids for whom the lottery made the difference. In fact, we can be more specific and say that the IV estimate reflects the treatment effect for compliers. IV estimate reflects the treatment effect for compliers

In practice one needs to do some corrections on the standard errors (the noise) to deal with the fact that the variable on the right hand side of the equation is estimated rather than fixed. But the principle is as we have done here.

Weak Instrument

The strength of an instrument is assessed by the first-stage F-statistic from the regression.

\[X = \gamma_0 + \gamma_1 Z + v.\]

A general rule of thumb (Stock and Yogo 2005) is:

  • F > 10: instrument is considered strong enough (weak-instrument bias is likely negligible).
  • F < 10: instrument is weak; 2SLS estimates may be biased toward OLS.

For multiple instruments, you can use the joint F-statistic testing that all excluded instruments equal zero (the “first-stage F” in regression output or summary(ivreg())).

summary(stage1)$fstatistic
##      value      numdf      dendf 
##   6.086237   1.000000 498.000000

Ivreg()

iv_model <- ivreg(score ~ charter | lottery, data = schooling)
summary(iv_model, diagnostics = TRUE)
## 
## Call:
## ivreg(formula = score ~ charter | lottery, data = schooling)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1509 -1.0007 -0.1125  1.0453  4.9280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03097    0.08550  -0.362    0.717    
## charter      2.86351    0.32783   8.735   <2e-16 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value   
## Weak instruments   1 498     6.086 0.01396 * 
## Wu-Hausman         1 497     9.329 0.00238 **
## Sargan             0  NA        NA      NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 498 degrees of freedom
## Multiple R-Squared: 0.948,   Adjusted R-squared: 0.9478 
## Wald test: 76.29 on 1 and 498 DF,  p-value: < 2.2e-16

Lab 9: Regression Discontinuity

In this part of the course, we are trying to find ways to “as if” randomize a treatment to units. Another way to put this is that we are trying to find a way to justify why a particular treatment assignment mechanism was as if random, such that we can make causal inferences. In the case of instrumental variables, we justified by saying that treatment was assigned by some variable \(Z\) that was “as good as” randomly assigned and that only influences the outcome via its affects on the treatment. This helped us avoid the problem of baseline differences, but such variables are very hard to find.

Now we look at a situation where treatment is determined via a cutoff rule: if you are “over” the threshold, you get treatment; if you are under the threshold, you get control. The idea is that people close to the threshold, it was “as if” random as to whether they were just over it or just under it, and thus those units are otherwise extremely similar. By “extremely similar” we mean they are (hopefully!) identical in terms of observed and unobserved characteristics and thus we can avoid the baseline differences problem. A closely related idea is that we need to believe that, at the threshold, nothing else changes. That is, we need to believe that for those who just got treatment (as opposed to just missed it, and got the control) the only thing different about their lives was being assigned treatment.

This approach will be called a Regression Discontinuity Design (RDD).

Background: the effects of a scholarship program

Suppose we are trying to estimate the effect of a special university scholarship program (treatment) on earnings after college. Perhaps the scholarship pays for some extra training, or networking, or generally just relieves financial burdens. Initially, we will suppose that the data we have is just all students’ outcomes (their salary) and whether they were in the special program or not. That regression looks like this:

scholars <- read.csv("data/scholarship_program.csv")
mod <-  lm(salary ~ scholarship, data=scholars)
summary(mod)$coef
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 199507.52   637.7081 312.85086  0.000000e+00
## scholarship  46485.44  1147.2094  40.52045 4.995335e-213

From this regression, it seems that getting the scholarship is worth an extra $46000 or so in salary. But it is plausible this is an overestimate: we can imagine that getting the scholarship is correlated with other things like being studious, or having family support, and that the true effect is a little lower. Supposing we had access to a confounder like family wealth, we might run the following regression:

mod2 <-  lm(salary ~ scholarship + family_wealth, data=scholars)
summary(mod2)$coef
##                   Estimate   Std. Error  t value      Pr(>|t|)
## (Intercept)   1.641367e+05 7.014239e+02 234.0050  0.000000e+00
## scholarship   2.629644e+04 6.661291e+02  39.4765 5.513084e-206
## family_wealth 3.621625e-02 6.431927e-04  56.3070 6.373797e-312

This seems to reduce the effect of the scholarship a fair amount (down to $26000 or so), but it still may be too large if there are other confounders. Put starkly: it seems likely that smart or diligent or well supported students would probably earn more money anyway, even in the absence of a scholarship program.

Set up and general idea

Now suppose that to get into the scholarship program, there’s a hard cut off exam. Specifically, if a student gets 90% or more on the entrance test, they are assigned to receive the scholarship. If a student gets below 90% on the entrance test, they are not assigned to the program.

We can be a little more formal and write:

\[ \text{scholarship program} = \begin{cases} 1 & \text{if } \text{exam score} \geq 90 \\ 0 & \text{if } \text{exam score} < 90 \end{cases} \]

Suppose, in addition, that the following features hold:

  1. treatment status is a deterministic consequence of the cut-off: if I know your score, I know your treatment. There is no probabilistic assignment here: we go from zero (no treatment, no scholarship) if you don’t hit the cut-off score, to getting treatment (scholarship) for sure if you do hit it. As an analyst, I don’t have to worry about who got the treatment: I know who did.

  2. treatment status is discontinuous at the cut-off: your treatment status “jumps” at the threshold (0 to 1), but not before. You don’t get some scholarship if you almost hit the cut-off. You don’t get more scholarship if you hit 91 rather than 90. You hit the threshold or you don’t; you get it, or you don’t.

To these, we add another assumption called “continuity at the threshold”. This is a formal way of saying nothing, other than treatment status, changes when we are (just) above or below the cutoff. Taken together, these assumptions allow us to compare people (just) above and below the threshold, which means comparing people who got treatment versus those who got control. And from here we can make causal claims.

We can be more formal about these ideas by rewriting the equation above in a more general way:

\[ D_i = \begin{cases} 1 & \text{if } R_i\geq c \\ 0 & \text{if } R_i < c \end{cases} \] Here, \(D_i\) is the treatment status for unit \(i\), which you can also write as \(T_i\) or \(X_i\). And then \(c\) is the common cutoff (it is not indexed by \(i\)). \(R_i\) is the unit’s score, it goes by several other names in this literature:

  • assignment variable: it decides your assignment to treatment or control
  • running variable: units have a value that runs along this variable, and that determines their treatment status
  • forcing variable: units have a value on this variable which forces them, deterministically, into treatment or control

The Inferential Problem and “Solution”

As we hinted, the basic idea is that around the 90% cutoff (i.e. a little above it and a little below it), those students who did not get into the program are very similar to those that did in terms of the observed and unobserved features they have—like home environment, studiousness and so on. From their perspective, getting into the scholarship program was as good as random. In fact, we could say it’s “exogenous”: the threshold was set prior to the students taking the test, and it cannot be changed by them (more on this in a minute).

Homogenous Slopes

All told then, the plan will be to compare the outcomes for those who just got it in, to those who only just didn’t make it, and draw causal conclusions. In order to make the causal comparison, in the simplest case, we need a regression of this form:

\[ \text{salary}_i = \alpha + \beta_1 \text{scholarship}_i + \beta_2\text{score}_i + \epsilon_i \] That is, a regression of the outcome on the treatment, but also the running variable (the exam score). Why do we need that latter piece? Well the idea is that we suspect there may still be a trend in which smarter/harder working students may get higher salaries, regardless of whether they were in the program or not. We’d want to check that there is some extra causal effect “bump” of the program over and above that effect.

Alternatively, we can dispense with what’s called the “smooth adjustment” for score if we just compare units very close to the cutoff with each other (above and below). In practice, this can be tricky because we perhaps do not have enough data (people) who are close enough to compare. So we do a controlled regression.

For our data, here is what the scatterplot looks like, with the threshold drawn on at the cutoff for the scholarship:

Let’s run the regression and draw lines on the plot for the fitted values it implies. We will also limit the data range a little to show the relationship in starker relief:

homog_mod <- lm(salary ~ scholarship + score, data= scholars)
summary(homog_mod)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -4096.702  2175.5028 -1.883106 5.997695e-02
## scholarship  7440.820   553.2680 13.448854 5.178174e-38
## score        2546.778    27.0933 94.000289 0.000000e+00
homog_fit <- fitted(homog_mod) # fitted values

Straight off the bat, it looks like being in the program (vs not) does indeed matter: it’s worth around $7000 extra of salary, averaged over the whole sample. This is the Average Treatment Effect. In purple, we plot the fitted values for each person:

Clearly, there’s some sort of “step up” in salary (\(y\) axis) as we go from not being in the program (\(<90\) score) to getting in. Note that under very stringent assumptions, the $7000 effect of the program is the size of the jump at the threshold. But generally those assumptions—like slope homogeneity—are not completely met, and we’ll deal with that below.

“Bandwidth”

As we mentioned above, an alternative idea to estimate the effect is just to limit ourselves to units very close to the cutoff, and compare means. We have to select what is known as a “bandwidth”—how close to the threshold we want to look—to do this. Let’s select a wide and a narrow one, and compare the results.

narrow_data <- scholars[which(89 <scholars$score&scholars$score< 91), ]
narrow_dif <-t.test(salary ~ scholarship , data = narrow_data)$estimate

Here, the implied difference (for students scoring between 89 and 91) is about $11000. For the large bandwidth, we have

wide_data <-  scholars[which(60<scholars$score&scholars$score<100), ]
wide_dif <- t.test(salary ~ scholarship , data = wide_data)$estimate

Here, the implied difference (for students scoring between 60 and 100) is about $44000. These may both be too large (i.e. biased) for the reasons we discussed above. But clearly, when we take a wide window of points (the latter comparison) the bias is considerably worse because there is much more that varies between the students that we are not controlling for.

Heterogenous Slopes

The situation above was especially simple because we assumed the slopes were homogenous: that is, we assumed that the effect of the score was (linear and) constant (same slope) either side of the cut-off. This seemed reasonable from the plot: there is a “jump” at 90, but the slope of the lines is similar, but with an obviously different implied intercept. Another way to see this is just to estimate the same regression either side of the cutoff and compare the \(\hat{\beta}s\):

mod_below <- lm(scholars$salary[scholars$score<90] ~ scholars$score[scholars$score<90])
summary(mod_below)$coef
##                                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                         -9052.549 2277.30297 -3.975118 7.776855e-05
## scholars$score[scholars$score < 90]  2608.768   28.37272 91.946334 0.000000e+00
mod_above <- lm(scholars$salary[scholars$score>90] ~ scholars$score[scholars$score>90])
summary(mod_above)$coef
##                                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                         60306.945 7783.63147  7.747919 1.374912e-13
## scholars$score[scholars$score > 90]  1948.911   81.64183 23.871478 6.133999e-72

These slopes—2608.8 and 1948.9—are not identical but they are not too far off.

Often though, we need to allow for (very) different slopes on either side of the cutoff. For now we will assume everything is still linear, but we could imagine a situation where, say, students who get higher scores above the cutoff get disproportionately higher salaries later in life. Perhaps being the “best of the best” sets them up for other awards or improves their subsequent peer group. Alternatively, the story might be that for those qualifying for the program, doing better doesn’t matter—but it does matter for those who miss out. For instance, for the best performing ones there might be other scholarships or programs those folks are referred to by their professors, so they get some extra boost from doing better than otherwise. An example of such data is provided in the next snippet.

The key is that we need to add an interaction term to our regression model, in the way we need previously to allow for different slopes for different groups (here defined by treatment status):

\[ \text{salary}_i = \alpha + \beta_1 \text{scholarship}_i + \beta_2\text{score}_i + \beta_3(\text{scholarship}_i\times\text{score}_i) + \epsilon_i \] In R, we have

scholars2 <- read.csv("data/scholarship_program2.csv")
het_mod <- lm(salary ~ scholarship*score, data= scholars2)
summary(het_mod)$coef
##                     Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept)       -55351.477  2896.03309 -19.11286  1.291459e-69
## scholarship       118791.599 10522.94295  11.28882  6.788793e-28
## score               2025.928    35.93775  56.37327 3.652869e-312
## scholarship:score  -1036.282   111.66203  -9.28052  1.022718e-19

We can see the interaction is significant, so we know that the effect of the scholarship now depends on the value of the student’s exam score. A good way to get a sense of the effects is to think about what happens to (average) salary when you from 80 to 90, versus 90 to 100.

new_data <- data.frame(
  score = c(80, 89, 91, 100),
  scholarship = c(0, 0, 1, 1)
)

predicted_salary <- predict(het_mod, newdata = new_data)

# Combine and label for clarity
new_data$predicted_salary <- round(predicted_salary)
new_data$group <- ifelse(new_data$scholarship == 1, "Scholarship", "Missed Out")

# Print results
print(new_data)
##   score scholarship predicted_salary       group
## 1    80           0           106723  Missed Out
## 2    89           0           124956  Missed Out
## 3    91           1           153498 Scholarship
## 4   100           1           162405 Scholarship
# assign outcomes to refer to in-text
below_boost <- new_data[2,3] - new_data[1,3]
above_boost <- new_data[4,3] - new_data[3,3]

so increasing your score by 9 points gets you about $18000 extra if you are below the cutoff, and (only) about $9000 more above. Let’s go back and plot the results in score salary space:

And let’s just check the slopes in each part of the graph:

mod_below2 <- lm(scholars2$salary[scholars2$score<90] ~ scholars2$score[scholars2$score<90])
paste("slope below is", summary(mod_below2)$coef[2,1] )
## [1] "slope below is 2025.92834586977"
mod_above2 <- lm(scholars2$salary[scholars2$score>90] ~ scholars2$score[scholars2$score>90])
paste("slope above is", summary(mod_above2)$coef[2,1] )
## [1] "slope above is 989.646675551644"

As demonstrated by the regression, the effect of a higher score below the cutoff is larger than above: the slope is clearly steeper. Just a presentation matter, note that sometimes authors do the same regression but with centered data. Here, you subtract the cutoff (90) from the running variable (the score) and then run the regression. It becomes:

scholars2$score_centered <- scholars2$score - 90 # center
het_mod_center <- lm(salary ~ scholarship*score_centered, data= scholars2)
summary(het_mod_center)$coef
##                              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)                126982.074  422.42173 300.60498  0.000000e+00
## scholarship                 25526.249  807.90051  31.59578 2.458821e-152
## score_centered               2025.928   35.93775  56.37327 3.652869e-312
## scholarship:score_centered  -1036.282  111.66203  -9.28052  1.022718e-19

This tells us the scholarship is worth an extra $25526.25—this is the estimated discontinuity when one goes from below to above the threshold.

Local Average Treatment Effect (LATE)

Note that, because the slopes are heterogenous, we conclude that the treatment effect varies with score. But if that is true then what we are estimating is a Local Average Treatment Effect (LATE). It’s the average effect of the treatment at the threshold: that is, for those specific students who were close to the cutoff (on either side). We thinking about what would happen if those students who just got treatment instead got control, and vice versa.

Thus, this LATE is not necessarily telling us what the causal effect of winning the scholarship would have been for someone far from the cutoff, or generally not in close contention. The treatment effect for those students may be larger, smaller, or even be negative (they could be worse off). Formally, we say the causal effect for students not near the threshold is not identified by this design.

RDD package

# Same slope (homogeneous)
rdd_data(y = scholars$salary,
         x = scholars$score,
         cutpoint = 90) %>%
  rdd_reg_lm(slope = "same") %>%
  summary()
## 
## Call:
## lm(formula = y ~ ., data = dat_step1, weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17465.5  -3471.1    -41.2   3609.5  17409.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 225113.31     339.82  662.45   <2e-16 ***
## D             7440.82     553.27   13.45   <2e-16 ***
## x             2546.78      27.09   94.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5341 on 997 degrees of freedom
## Multiple R-squared:  0.9617, Adjusted R-squared:  0.9616 
## F-statistic: 1.251e+04 on 2 and 997 DF,  p-value: < 2.2e-16
# Different slopes (heterogeneous)
rdd_data(y = scholars$salary,
         x = scholars$score,
         cutpoint = 90) %>%
  rdd_reg_lm(slope = "separate") %>%
  summary()
## 
## Call:
## lm(formula = y ~ ., data = dat_step1, weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17272.3  -3472.0    -15.1   3447.9  17803.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 225736.57     342.13 659.804  < 2e-16 ***
## D             9972.38     641.50  15.545  < 2e-16 ***
## x             2608.77      27.75  94.019  < 2e-16 ***
## x_right       -659.86      90.53  -7.289 6.35e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5206 on 996 degrees of freedom
## Multiple R-squared:  0.9636, Adjusted R-squared:  0.9635 
## F-statistic:  8791 on 3 and 996 DF,  p-value: < 2.2e-16

Sensitivity Analysis

lm_sensitivity <- scholars %>%
  filter(score >= 89 & score <= 91) %>%  # narrow window around the cutoff
  mutate(threshold = ifelse(score >= 90, 1, 0)) %>%
  lm(salary ~ threshold + I(score - 90) + threshold:I(score - 90), data = .)

summary(lm_sensitivity)
## 
## Call:
## lm(formula = salary ~ threshold + I(score - 90) + threshold:I(score - 
##     90), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13437.7  -3454.3    685.1   3361.4  11078.2 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               226707       1431 158.474  < 2e-16 ***
## threshold                   7245       1975   3.668 0.000442 ***
## I(score - 90)               3429       2521   1.360 0.177657    
## threshold:I(score - 90)      944       3577   0.264 0.792552    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4721 on 79 degrees of freedom
## Multiple R-squared:  0.5935, Adjusted R-squared:  0.5781 
## F-statistic: 38.45 on 3 and 79 DF,  p-value: 2.003e-15
scholars %>%
  filter(score >= 85 & score <= 95) %>%      # zoom around the cutoff
  select(score, salary) %>%
  mutate(threshold = as.factor(ifelse(score >= 90, 1, 0))) %>%
  ggplot(aes(x = score, y = salary, color = threshold)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Accent") +
  guides(color = FALSE) +
  geom_vline(xintercept = 90, color = "red",
             size = 1, linetype = "dashed") +
  labs(y = "Salary (post-graduation)",
       x = "Exam Score (running variable)")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

Lab 10: Diff-in-Differences

To recap, we are trying to find ways to “as if” randomize a treatment to units. In the case of Regression Continuity, the idea was that there was an objective non-manipulable threshold and if you met that cutoff you got that treatment, and if you didn’t you were in the control condition. The key was that units who very close to threshold were (assumed) very similar in every other relevant way except their treatment status. So, in order to estimate the treatment effect, we simply had to compare those two groups.

To recap: New Jersey was treated, Pennsylvania was not. The idea is then to:

  • Measure the outcome in both before treatment (but only one is treated!)
  • Measure the outcome in both after treatment (remember only one is treated!)

and then compare how each state changed relative to its baseline (value of outcome). This “difference-in-differences” tells us the effect of treatment. The actual data looked like this:

NJ PA DiD
Feb 1992 20.44 23.33
Nov 1992 21.03 21.17
Change +0.59 -2.16 2.75

At bottom right you can see the “difference-in-differences” was 2.75. This is calculated as:

\[ \begin{align} &= (\text{NJ in Nov} - \text{NJ in Feb}) - (\text{PA in Nov} - \text{PA in Feb}) \\ &= (21.03 - 20.44) - (21.17 - 23.33) \\ &= 0.59 - (-2.16) \\ &= 2.75 \end{align} \]

So, the effect of a higher minimum wage on employment was \(+2.75\).

nj = read.csv("data/data-difference-in-differences.csv")

# Prepare data for fit2
nj2 <- nj %>%
  dplyr::select(
    y_ft_employment_after,
    y_ft_employment_before,
    d_nj,
    x_burgerking,
    x_kfc,
    x_roys,
    x_co_owned,
    x_st_wage_before,
    x_st_wage_after,
    x_closed_permanently,
    x_southern_nj,
    x_central_nj,
    x_northeast_philadelphia,
    x_easton_philadelphia
  ) %>%
  mutate(
    x_st_wage_after = case_when(
      x_closed_permanently == 1 ~ NA_real_,
      TRUE ~ as.numeric(x_st_wage_after)
    )
  ) %>%
  na.omit()

# Manual diff regression
# =========================
# Regression 1: Manual DiD (first-difference)
# Outcome: Δ employment = after - before
# =========================
fit2 <- lm(
  (y_ft_employment_after - y_ft_employment_before) ~
    d_nj + x_burgerking + x_kfc + x_roys + x_co_owned,
  data = nj2
)

summary(fit2)
## 
## Call:
## lm(formula = (y_ft_employment_after - y_ft_employment_before) ~ 
##     d_nj + x_burgerking + x_kfc + x_roys + x_co_owned, data = nj2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.050  -3.685   0.584   4.077  27.169 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -2.2067     1.6082  -1.372   0.1709  
## d_nj           2.2815     1.1970   1.906   0.0575 .
## x_burgerking   0.7566     1.4911   0.507   0.6122  
## x_kfc          0.9912     1.6750   0.592   0.5544  
## x_roys        -1.3280     1.6811  -0.790   0.4301  
## x_co_owned     0.3729     1.0988   0.339   0.7345  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.721 on 345 degrees of freedom
## Multiple R-squared:  0.02038,    Adjusted R-squared:  0.006185 
## F-statistic: 1.436 on 5 and 345 DF,  p-value: 0.2108
# Build long data for standard DiD
nj_long <- nj2 %>%
  mutate(id = row_number()) %>%
  tidyr::pivot_longer(
    cols = c(y_ft_employment_before, y_ft_employment_after),
    names_to = "period",
    values_to = "y"
  ) %>%
  mutate(
    post = ifelse(period == "y_ft_employment_after", 1, 0)
  )

# Standard DiD regression
# =========================
# Regression 2: Standard DiD
# Outcome: y_it
# Treatment: d_nj
# Post: post
# ATT = coefficient(d_nj:post)
# =========================
fit_did <- lm(
  y ~ d_nj * post +
    x_burgerking + x_kfc + x_roys + x_co_owned,
  data = nj_long
)

summary(fit_did)
## 
## Call:
## lm(formula = y ~ d_nj * post + x_burgerking + x_kfc + x_roys + 
##     x_co_owned, data = nj_long)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.111  -5.013  -1.226   3.146  63.646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.77120    1.30249  19.018  < 2e-16 ***
## d_nj         -2.27878    1.13764  -2.003   0.0456 *  
## post         -1.87879    1.44621  -1.299   0.1943    
## x_burgerking  1.61837    1.00446   1.611   0.1076    
## x_kfc        -9.25074    1.12834  -8.199 1.18e-15 ***
## x_roys       -0.09739    1.13247  -0.086   0.9315    
## x_co_owned   -1.04091    0.74018  -1.406   0.1601    
## d_nj:post     2.27686    1.60495   1.419   0.1565    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.308 on 694 degrees of freedom
## Multiple R-squared:  0.2184, Adjusted R-squared:  0.2105 
## F-statistic:  27.7 on 7 and 694 DF,  p-value: < 2.2e-16

Why does this work as a causal estimate? The idea—the assumption—is that whatever covariate or confounder differences there are between Pennsylvania and New Jersey, those systematic *differences are not changing over time**. If this is true, then any “extra” difference in November—over and above the difference in February—must be due only to the fact that New Jersey has been treated. In which case, if we subtract off the previous difference from the current difference, what’s left is the causal effect of treatment.

Note that, in principle, we could make sure we account for all relevant (non time varying) covariates with controls, explicitly. But that’s probably very hard to do in practice. So DiD is a good choice in that sense.

Graphically, it requires a set up like this:

DiD
Implied DiD design from Card & Krueger

A nice feature of DiD designs is that they control for anything relevant that differs between the units but that does not vary over time. For example, in the case of New Jersey and Pennsylvania, their relative populations don’t vary much over the time period of this study: New Jersey had about 7.8 million people, Pennsylvania had about 12 million. This design controls for any effect on employment that this difference had.

What Could Go Wrong?

By contrast, DiD does not control for things that do vary over time. So we need to be concerned about any variable that potentially caused employment in New Jersey and Pennsylvania to evolve or change differently over time other than the minimum wage change. Suppose, for instance, that in response to New Jersey increasing the minimum wage, Pennsylvania cuts taxes on tips. Or suppose that New Jersey implemented the increase in minimum wage because it’s economy was (relative to other local ones) experiencing an upswing. These things would probably mean that even in the absence of treatment employment in New Jersey and Pennsylvania would not have followed similar trends over time.

What can we do about this? One idea is to check pre-trends. This is, we look at how employment changed over time before either unit was treated. Ideally, we would go back as far as we can historically, and check that those are parallel to each other: if they are diverging (or converging), this is likely a problem.

Another issue that is especially prevalent where the units being treated are geographically close is “spillovers”. As the wage rises in New Jersey, one can imagine that workers switch from restaurants in Pennsylvania and commute to New Jersey—especially in bordering areas. But now Pennsylvania is also getting treatment and this will typically drive down the difference between the states (and thus the causal effect) towards zero. This is a problem for parallel trends, and turns up as something called a SUTVA (Stable Unit Treatment Value Assumption) violation in the econometrics literature. SUTVA requires that one unit doesn’t “interfere” with another in treatment or outcome terms.

Lab 11: Logistic Regression

Dealing with outcome that is bounded between 0 and 1 (e.g., probability, proportion), we use logistic regression.

Log Laws

For any positive \(a,b\) and base \(k>0, k\neq 1\):

\[ \log_k(ab) = \log_k a + \log_k b \]

\[ \log_k\left(\frac{a}{b}\right) = \log_k a - \log_k b \]

\[ \log_k(a^n) = n \log_k a \]

\[ \log_k a = \frac{\log_m a}{\log_m k} \]

\[ k^{\log_k a} = a \]

Anti-log

\[ \text{antilog}_k(x) = k^x \]

If \(x = \log_k a\), then:

\[ a = k^x \]

Here is a clean R Markdown–ready explanation in math form:

Relationship Between \(\log\) and \(e\)

The natural logarithm uses base \(e\), where

\[ e \approx 2.71828. \]

The natural log is written as:

\[ \ln(x) = \log_e(x). \]

The natural logarithm and the exponential function \(e^x\) are inverses:

\[ \ln(e^x) = x, \qquad e^{\ln(x)} = x. \]

Conversion between natural log and log base 10:

\[ \log_{10}(x) = \frac{\ln(x)}{\ln(10)}, \qquad \ln(x) = \log_{10}(x)\,\ln(10). \]

Derivative identities (optional but standard):

\[ \frac{d}{dx}\ln(x) = \frac{1}{x}, \qquad \frac{d}{dx}e^x = e^x. \]

Odds and Odds Ratios

Let’s think in a bit more detail about probabilities. If we divide the probability of something happening by the probability of it not happening, we have the odds of that event. So the odds of drawing a Politics major from the class is:

\[ \frac{\Pr(y_i=\text{Politics})}{\Pr(y_i=\text{Econ})} = \frac{\Pr(y_i=\text{Politics})}{1-\Pr(y_i=\text{Econ})} \]

That is:
\[ \frac{0.64}{0.36} = \textbf{1.78} \]
Thus, “drawing a Politics major is 1.78 times as likely as drawing an Econ major” (which makes sense, because there are almost twice as many Politics majors).

What is the odds of voting for those with salaries between $50k and $90K? We can work it out like this for these earners:

voting_data <- read.csv("data/voting_data.csv")
earners <- voting_data[which(voting_data$income<90 & voting_data$income>50),]
how_many_earners <- nrow(earners)
pr_voting <-  sum(earners$turnout == 1)/how_many_earners
pr_not_voting <- sum(earners$turnout == 0)/how_many_earners
odds <- pr_voting/pr_not_voting
print(odds)
## [1] 0.3050847

\[ \frac{\Pr(y_i=1 \mid 50 < x_i < 90)}{\Pr(y_i=0 \mid 50 < x_i < 90)} = \frac{72/308}{236/308} \] Because 72 of the 308 in that income band voted, and 236 did not. So, the odds of voting (for this group) is 0.305. They are about a third as likely to vote as not vote.

Alternatively, what are the odds of not voting? \[ \frac{236/308}{72/308} = 3.277 \]
So this group is a little over 3 times more likely not to vote as to vote. This might be obvious, but note we are dealing with ratios now, so the odds of voting are not equal to \(1 - \text{(odds of not voting)}\). This is just not how odds are defined.

Suppose we also calculate odds of voting for those earning more than $90K:
\[ \frac{436/438}{2/438} = 218 \]
So this group is two hundred times as likely to vote as not vote: essentially everyone votes in this group.

Odds Ratio

We can compare the odds for the two groups. Let’s divide the odds of voting for the high income folks by the odds of voting in the low income group:

\[ \frac{218}{3.277} = 66.5 \]

This is called the odds ratio. Here it tells us that the odds of voting in the richer group are about 66 times greater than the odds in the poorer group. Note that this doesn’t mean the richer group has a probability of voting that’s 66 times larger. It means that their odds are 66 times higher. The jury is out on whether this is helpful to report (!)


What we want: Logistic Regression

One thing we might want to know is how the probability of voting changes systematically as we alter variables on the right-hand side—like income. It would be great if we could just put our explanatory variables (our $Xs) with coefficients for their effects \(\beta\)s on that right hand side, like we do in the usual regression context. As we have seen though, we can’t do it directly with \(y_i \in \{0,1\}\): this gives the LPM, and we said we might not want that set up.

So we need to put something else on the left-hand side (the place we would usually put \(Y\) in a linear regression). The binary outcome, voting or not voting, is still the dependent variable. But we need to find a transformation—a mathematical operation—of the dependent variable that allows to do our regression in the way we want.

We use the log of the odds and we call it the logit of the probability. We can write the logit as:
\[ \text{logit}(\Pr(y_i = 1)) \quad \text{or} \quad \log \left( \frac{\Pr(y_i=1)}{1 - \Pr(y_i=1)} \right) \quad \text{or} \quad\text{logit}(p) \] where \(p\) stands in for \(\Pr(y_i = 1)\). We will make this \(\text{logit}(p)\) a linear function of our \(X\)s. So we’ll end up with something like:

\[ \text{logit}(p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n \] Fitting this model to data, such that we estimate all the \(\beta\)s is called logistic regression.

A Note on Logs

The logistic regression uses logs. A log or logarithm is a function that takes a number and returns an exponent to which the base must be raised to get that number.

For example, suppose we have the number 100. Suppose our base is 10. What number do we need to raise 10 to, to get 100? Well, 100 is \(10^{\color{blue}{2}}\), so the answer is 2.

print(10**2)
## [1] 100

Using base 10 is called the common log. Its antilogarithm requires you take 10 to that number: if you give me 2, I can “get back” to the original number by computing \(10^2 = 100\). In R you can take the log of anything you want (well, so long as the number is not less than or equal to zero) to any base you want:

log(100, base=2)
## [1] 6.643856

It turns out that a more popular base in statistics is \(e\) (“Euler’s number”). Here, \(e \approx 2.7\). If we use that base, we are taking the natural log. It’s so popular that most statistical packages use it by default when you call log().

log(100, base= 2)
## [1] 6.643856
log(100)
## [1] 4.60517
log(100, base= exp(1)) # euler's number is generated as e^1 or exp(1)
## [1] 4.60517

Its antilogarithm is computed by taking \(e\) to that number. So for example

  • The antilog of 3 is \(e^3 = 20.09\)
  • The antilog of -1.5 is \(e^{-1.5} = 0.223\)
exp(log(100, base= exp(1))) # this is taking the natural antilog of a natural log
## [1] 100

Note: When we take \(e\) to a number \(n\) (\(e^n\)), the absolute result is always greater than zero, though it can be less than 1 if the exponent is negative.

Working with logit

Remember the main logistic regression equation: \[ \text{logit}(p) = \beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n. \]

It may not be obvious, but this set up forces the resulting probabilities to always be between zero and one. That’s what we want. The way we get estimate this model is not the same as with the linear model. And we won’t interpret the results the same way, either.

Estimating a logistic regression

In R, we use the glm() function to estimate a logistic regression (sometimes called a “logit”). The function glm stands for GLM or generalized linear model. A logistic regression is one example of a GLM. In fact, the linear model we have been using up to now is also a special case of a GLM. The way we define GLMs is technical and specific, but one important part is the idea of a “linear predictor”—this means you will find that \(Y\) (your outcome) somehow depends on \(\beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n\) like in the linear regression.

Because there are lots of GLMs, you need to give R specific instructions as to which you want it to fit. Here is the syntax for the logistic regression of turnout on income and renting (i.e. whether you are a renter or someone who owns your property):

glm(formula = turnout ~ income + renter, family = binomial(link = "logit"), 
    data = voting_data)

The family argument tells R you want a logistic regression, specfically.

Interpreting a Logistic Regression

As with lm we can ask for the regression table:

logit_model <-glm(formula = turnout ~ income + renter, family = binomial(link = "logit"), 
    data = voting_data)
summary(logit_model)
## 
## Call:
## glm(formula = turnout ~ income + renter, family = binomial(link = "logit"), 
##     data = voting_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -23.22313    2.64534  -8.779   <2e-16 ***
## income        0.29419    0.03306   8.898   <2e-16 ***
## renter       -0.95420    0.41140  -2.319   0.0204 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.04  on 999  degrees of freedom
## Residual deviance:  158.43  on 997  degrees of freedom
## AIC: 164.43
## 
## Number of Fisher Scoring iterations: 9

Raw Coefficients

Let’s take a closer look at the logit coefficients.

require(knitr)
knitr::kable(summary(logit_model)$coefficients, caption = "")
Estimate Std. Error z value Pr(>|z|)
(Intercept) -23.2231315 2.6453436 -8.778871 0.0000000
income 0.2941886 0.0330614 8.898247 0.0000000
renter -0.9542047 0.4113996 -2.319411 0.0203728

The Intercept tells us the log-odds of voting when the \(X\)s take a zero value.

\(\hat{\beta}\) is “the effect of a one unit change in \(X\) on the log odds”

  • So if our \(\hat{\beta}\) on income is 0.29, we say that a one thousand dollar increase in income increases the log-odds of voting by 0.29, holding all other variables constant.
  • If our \(\hat{\beta}\) on renter is -0.95, we say that renting (as opposed to buying) decreases the log-odds of voting by -0.95, holding all other variables constant.

We can also take \(e^{\hat{\beta}}\) and:

  • if our \(\hat{\beta}\) on income is 0.29, we say that the a one thousand dollar increase in income increases the odds of voting by \(\exp(\hat{\beta})=\exp(0.29)\) which is 1.3364275, holding all other variables constant.
  • if our \(\hat{\beta}\) on renter is -0.95, we say that renting (as opposed to buying) decreases the odds of voting by 0.386741, holding all other variables constant.

Neither of these is typically very helpful so we usually just note that:

  • if our \(\hat{\beta}\) is positive, this means this variable has a positive effect on the probability of voting (if it increases, we are more likely to vote)
  • if our \(\hat{\beta}\) is negative, this means this variable has a negative effect on the probability of voting (if it increases, we are less likely to vote)

A difference between the linear model and the logit is that the latter uses \(Z\) values rather than \(t\) values in the summary table. This is due to the method by which the logit is fit to the data, which is not ordinary least squares. But for our purposes you can interpret the \(p\)-value in statistical significance terms as you would there. So, here, both variables are statistically significant predictors of the outcome.

Predicted Probability

We might also want to know the predicted probability of an outcome, contingent on the particular characteristics a given observation has.First, let’s just get the predicted probability for everyone currently in the data. This is obtained by

predicted <- predict(logit_model, type = "response")

If we want to look at a specific person already in the data, that’s easy. Here’s observation number 34 and our prediction for them:

print(voting_data[34,])
##     X turnout   income renter
## 34 34       1 122.3881      0
cat("\n probability of voting ==",predicted[34],"\n")
## 
##  probability of voting == 0.9999972

They are a high earner and a house-owner, so we predict they will turnout with a very high probability: 0.99. And, in fact, they did turnout in the real data. But it’s more interesting to think about people not in the data. For example, suppose the person earned $55k a year, and was a renter: what would be predict their probability of voting to be?

new_person <- data.frame(income=55, renter=1)
predict(logit_model, newdata = new_person, type = "response")
##            1 
## 0.0003363701

That’s pretty unlikely! What about someone who earns $83k but rents? We can work this by hand. First, it turns out that \[ \Pr(y_i=1) = \frac{\text{odds}}{1+\text{odds}} = \frac{\exp(\hat{\beta} X)}{1+\exp(\hat{\beta} X)} \] Well, the log odds for some with this profile is \[ \text{intercept} + \beta_1\times X_1 + \beta_2 \times X_2 \]

which is \[ - 23.22 + 0.294\times83 -0.95\times 1 = 0.232. \] We can then get the odds by taking \(\exp(\hat{\beta}X) = e^{0.232} =1.26\). In which case, the probability is \[ \Pr(y_i=1) = \frac{1.26}{1+1.26} = 0.56 \] In R we do:

new_rich_person <- data.frame(income=83, renter=1)
predict(logit_model, newdata = new_rich_person, type = "response")
##         1 
## 0.5597923

Marginal Effects

With the linear model, the marginal effect of a specific \(X\) on \(Y\)—that is, what happens to \(Y\) as we change that \(X\) keeping all other variables constant—was just the \(\hat{\beta}\) on that \(X\). It didn’t matter how many of variables were in the model, or their values.

Things are not so simple with the logistic regression, because the marginal effect of a given \(X\) can change depending on the values the other variables take. For example, the effect of renting (going from renter =0 to renter = 1) on turnout probability can be larger depending on the value of income.

To get a sense of this problem, consider the following code. Here, we move income from $10k a year to $200k, and see what happens to turnout probability. And we do this for people who are renters (in blue) and buyers (in red). In both groups, as income increases, the probability of turnout increases. But the effect is not the same for every level of income.

increasing_income_renter <- data.frame(income=seq(10,200), renter=1)
increasing_income_buyer <- data.frame(income=seq(10,200), renter=0)

renters <- predict(logit_model, newdata = increasing_income_renter, type = "response")
buyers <- predict(logit_model, newdata = increasing_income_buyer, type = "response")

plot(seq(10,200), renters, xlab="income", col="lightblue", type="b", 
     pch=16, xlim=c(50,100), ylab="Pr(turnout)")
points(seq(10,200), buyers, col="red", type="b", pch=16)
grid()
legend("topleft", pch=c(16, 16), col=c("lightblue", "red"), legend=c("renters", "buyers"))

For example, look at the turnout probability when income is $60k: it doesn’t make much difference whether one is a renter or buyer. There’s no gap between the groups, implying renting v buying has no effect. What about at $80k? Here the gap is quite large, with buyers almost twice as likely to vote. But this just makes our point: the effect of renting differs depending on what income group we are talking about. So the effect is not constant. Note, as an aside, how—unlike the LPM—the probability is not linear in the covariate (here income) for either case: the probability curves smoothly (and is always between 0 and 1)

Consequently, we might like a systematic way to summarize the effects of the variables across these sorts of differences. There are lots of different options here, but we’ll look at the most popular option. First, we need the margins package.

We will calculate Average Marginal Effects (AME)—this asks “On average, how does a one-unit change in \(x\) affect the probability of turning out in the data?”

For a given person in the dataset, we first calculate the effect of changing \(x\) (say, income) by one unit, and see what happens to their probabilty of voting. Then we do the same for every other person in the data set, and at the end take the average of all these individual effects.

For our data set, this is

 summary(margins(logit_model))
##  factor     AME     SE       z      p   lower   upper
##  income  0.0071 0.0001 66.9271 0.0000  0.0069  0.0073
##  renter -0.0229 0.0096 -2.3977 0.0165 -0.0417 -0.0042

This says that, for income, a one unit ($1000) increase in income increases the probability of voting by 0.0071, on average. However, switching from buying (renter=0) to renting (renter=1) reduces the probablity of voting by 0.0229, on average. This doesn’t mean it’s 0.0071 (or 0.0229) everywhere for everyone, just that on average that is the value.

Panel Data

Fixed Effects

The idea now is to introduce a different intercept for each person. There’s two main ways to proceed, and we will start with the simplest which is called “fixed effects”—which connotes the (correct) idea that we are dealing with something that is non-random and fixed (for the unit) over time. Other names include the “within estimator” and the “individual dummy model”.

This latter term captures the intuition that it is as if we adding a dummy variable for each person or unit. To recap, a dummy variable is a variable that can only take two values: zero or one. In the Snow case, there will be variables that:

  • take the value of “1” if the unit we are dealing with is 1 Broad Street, and zero otherwise—that is, all other units get “0” on this variable.
  • take the value of “1” if the unit we are dealing with is 3 Broad Street, and zero otherwise—that is, all other units get “0” on this variable.
  • take the value of “1” if the unit we are dealing with is 5 Broad Street, and zero otherwise—that is, all other units get “0” on this variable.
  • take the value of “1” if the unit we are dealing with is 1 Wick Street, and zero otherwise—that is, all other units get “0” on this variable.
  • …etc etc

…for every household except the last one, which (for various technical reasons) doesn’t need its own dummy if everyone else has one. Ultimately, this adds \(n-1\) variables to the dataset where, recall, \(n\) is the number of units (households or people or countries).

More importantly, adding this dummy wipes out the effect of any other variable that does not change over time in our data: it will just disappear from our regression. This makes sense: if some background factor doesn’t change, it can’t itself have caused any changes in Y. That is, you cannot explain something that’s changing with something that’s constant, so let’s just scrub it. Fixed effects let you do that, and it doesn’t matter whether the background covariate is observed or unobserved.

For example, in our survey above, we think “household as a child” is not changing over time. So whatever effect it has is included in a person’s intercept term, and that’s it. Similarly, “political culture” is not changing (much!) over time. So whatever effect it has on autocracy v democracy should be included in the country’s intercept term, and that’s it.

Fitting Fixed Effects

First, let’s do exactly what we suggested above: just add a dummy variable for every unit. In the standard lm setup, this would be equivalent to:

gop_panel_data = read.csv("data/gop_panel_data.csv")
big_diff = read.csv("data/big_diff.csv")
# with base R/lm -- no special packages
fe_base<- lm(thermometer ~ income + factor(respondent), data = gop_panel_data)
summary(fe_base)$coef
##                      Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)         27.469569 0.36241826 75.79521 5.355233e-33
## income               0.145873 0.00340183 42.88073 2.249311e-26
## factor(respondent)2  4.528194 0.44995106 10.06375 1.238668e-10
## factor(respondent)3  4.693291 0.44995106 10.43067 5.702166e-11
## factor(respondent)4  8.329712 0.44995106 18.51248 7.157871e-17
## factor(respondent)5 13.536973 0.44995106 30.08544 2.652356e-22
## factor(respondent)6 16.346578 0.44995106 36.32968 1.836325e-24
## factor(respondent)7 31.238796 0.44995105 69.42710 5.659064e-32

the factor(unit) term just reminds lm to treat each different numerical entry in respondent as if it’s a different dummy variable. Actually, what’s really happening here is that we are adding a single variable which takes a different categorical value depending on the observation number, but the effect is the same.

It turns out though that there is a more efficient—in a computational sense—way to fit fixed effects. This involves transforming the data set by subtracting off the mean of each column and then running the relevant linear regression. Packages like fixest can do this for you with a simple call like this:

# Fixed effects: unit-specific intercepts
fe_model <- feols(thermometer ~ income | respondent, data = gop_panel_data)
summary(fe_model)
## OLS estimation, Dep. Var.: thermometer
## Observations: 35
## Fixed-effects: respondent: 7
## Standard-errors: Clustered (respondent) 
##        Estimate Std. Error t value  Pr(>|t|)    
## income 0.145873   0.004104 35.5405 3.308e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.624861     Adj. R2: 0.995917
##                  Within R2: 0.985529

To make the point that fixed effect wipe out time invariant characteristics, let’s compare the pooled model and the fixed effects model with an extra covariate: union_household:

# Fixed effects: unit-specific intercepts
pool2 <- lm(thermometer  ~ income + union_household, data=gop_panel_data)
summary(pool2)$coef
##                    Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      42.2977167  2.5890892 16.336910 4.396962e-17
## income            0.1458911  0.0392494  3.717026 7.699759e-04
## union_household -12.5649731  3.0712821 -4.091117 2.709407e-04

Taken at face value, this implies being from a union_household has a negative effect. For the fixed effects model we would use.

fe_model2 <- feols(thermometer ~ income + union_household | respondent, data = gop_panel_data)
## The variable 'union_household' has been removed because of collinearity (see $collin.var).

Note the warning: this is telling us that, once we use fixed effects, this variable must be dropped. And finally:

summary(fe_model)
## OLS estimation, Dep. Var.: thermometer
## Observations: 35
## Fixed-effects: respondent: 7
## Standard-errors: Clustered (respondent) 
##        Estimate Std. Error t value  Pr(>|t|)    
## income 0.145873   0.004104 35.5405 3.308e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.624861     Adj. R2: 0.995917
##                  Within R2: 0.985529
Interpreting Fixed Effects

Here, using fixed effects made a very small difference to the coefficient on income relative to using a pooled model. But that isn’t always the case. Consider this set of data:

bd <- read.csv("data/big_diff.csv")

This is (simulated) data on people getting specialty training (like certificates in software support, or a masters in some complex area of law), and how that affects their net worth. The data has been created such that the overall correlation between specialty training and net worth isn’t that large, but for a given individual giving them more specialty training boosts their income a lot especially if they start out relatively poor.

This sort of pattern could emerge because, say, already wealthy people typically select into extra specialty training for prestige reasons; meanwhile, the people for whom it really makes a difference are those who don’t typically recognize its value or cannot take a financial risk on obtaining it.

Now, the true causal relationship between \(X\) and \(Y\) is always positive. However, the level of \(X\) is highly correlated with unobserved characteristics about the students: richer people who don’t get much from it tend are nonetheless more likely to select into it. The pooled model mistakenly attributes variation in \(Y\) that is due to unchanging unobserved differences across individuals to the effect of \(X\).

pool3 <- lm(net_worth  ~ specialty_training, data=bd)
summary(pool3)$coef
##                     Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)        10.558034  0.2521295 41.875449 2.813052e-165
## specialty_training  1.576269  0.1622599  9.714469  1.519798e-20

But that makes it look like the return to extra specialization isn’t very high: the people who get big doses of \(X\) are already rich, so we underestimate the return to this education. Compare this to

fe_model3 <- feols(net_worth ~ specialty_training | id, data = bd)
summary(fe_model3)$coeftable
##                    Estimate Std. Error  t value     Pr(>|t|)
## specialty_training 2.018309 0.04514054 44.71167 1.847182e-67
## attr(,"type")
## [1] "Clustered (id)"

Now the coefficient—the return to education—is much larger. This is correctly adjusting for fixed background characteristics we didn’t get to observe.

Depending on the software, one might get a value for all the different intercepts or, more usually, a single intercept and then the deviations from that common value for the different units. For the example above, you could do

fixef(fe_model2)
## $respondent
##        1        2        3        4        5        6        7 
## 27.46957 31.99776 32.16286 35.79928 41.00654 43.81615 58.70836 
## 
## attr(,"class")
## [1] "fixest.fixef" "list"        
## attr(,"exponential")
## [1] FALSE

In any case though, notice that researchers don’t usually display the values of the estimated intercepts. The idea is mostly to make sure we have controlled for them. As usual, the slope coefficient tells you

How much the outcome \(Y\) changes within a unit, on average, when \(X\) changes by one unit, holding constant all time-invariant characteristics of that unit.

This differs to the usual pooled or cross sectional case, because we are now talking explicitly about changes within unit, by which we mean “following that unit over time and comparing them with themselves”

Random Effects

So called random effects are an alternative to fixed effects, insofar as they also allow one to control for unobserved heterogeneity in panel data. The underlying methods are technical, but it suffices to give the general idea: we will have a common intercept for everyone (all the units), but assume each person (each unit) has a random term idiosyncratic to them that moves them up or down relative to the shared constant.

Of course, as usual in regression, everyone has random error term that can take different (residual) values in different time periods: \(\epsilon_{it}\). So it is helpful to think of the random effects model as being a kind of composite error term: part of an individual’s variation comes from pre-existing unobserved but systematic factors that make them different to the average, and part is just “random” error.

This is more conceptually, but random effects models have nice features: they are more “efficient” for one thing. This means you can get more precise estimates of the constants. In addition, when you use random effects you don’t completely swallow the effects of variables that don’t change over time—like “child union household” in our previous example.

So why not always use them? As usual, it is because we need to make other—often less plausible—assumptions. Specifically, we need to believe that there is zero correlation between an individual’s personal unit effect (their intercept) and anything in the model—that is, any \(X\)s we choose to put in the regression. The reason for this is technical; essentially the problem is if there is a correlation, then the error term of the regression contains confounders that should have been controlled for in the regression.

In social science, this is obviously very common: we often think there is something “special” about individuals that comes from things we would like to control for that affect \(X\) and \(Y\). Factors like family background or health. So you tend to see random effects in natural or biological sciences more, or when people are doing experiments rather than working with observational data. In any case, there are methods like the Hausman Test to tell you when random effects might be a reasonable choice.

Here is our previous regression using fixed effects, then random effects; we’ll use the plm package for this:

Using plm()
bd <- read.csv("data/big_diff.csv")

# declare panel structure
pdata <- pdata.frame(bd, index = "id")
## Warning in pdata.frame(bd, index = "id"): column 'time' overwritten by time
## index
# Random fx
re_model <- plm(net_worth ~ specialty_training, data = pdata, model = "random")

summary(re_model)$coef
##                     Estimate Std. Error  z-value     Pr(>|z|)
## (Intercept)        10.528487 0.55942179 18.82030 5.149103e-79
## specialty_training  2.009547 0.05073551 39.60829 0.000000e+00

Now the fixed effects version:

# Fixed fx
fe_model <- plm(net_worth ~ specialty_training, data = pdata, model = "within")
summary(fe_model)$coef
##                    Estimate Std. Error  t-value      Pr(>|t|)
## specialty_training 2.018309 0.05102448 39.55569 3.783626e-140

Finally, the Hausman test:

phtest(fe_model, re_model)
## 
##  Hausman Test
## 
## data:  net_worth ~ specialty_training
## chisq = 2.6109, df = 1, p-value = 0.1061
## alternative hypothesis: one model is inconsistent

Here, \(p>0.05\) (i.e. no significant difference between the models) which implies the random effects model may be reasonable.

Time Fixed Effects

So far we’ve been thinking about unit (fixed or random) effects. These took care of \(X\)s we thought didn’t vary over time (like child household nature). By contrast, time fixed effects take care of \(X\) we think don’t vary over units, but affect all units equally.

For example, you might think that there was something special about people’s vote choices in 2024 (e.g. Trump was a former incumbent running for President, but not currently president), and so you might want to include a `special effect’ for that year to check your general conclusions about income and vote choice are correct. Or you might want to include a time fixed effect for whenever (in their lifetimes) people graduated college if this, say, may have changed their views on political issues.

“Two-way” Fixed effects

As with unit fixed effects, the essence of the technique is to create a kind of dummy variable for every time period. These can be doubled up with fixed effects to form a two way fixed effects (TWFE) model, though this requires substantial variation in the data to be estimable. For instance, if everyone gets the treatment in the same year, it will be tough to separate the treatment effect from a time effect.

For completeness, this is what a two-way fixed effects model looks like in plm:

# declare panel structure
ptdata <- pdata.frame(bd, index = c("id", "time"))

# Two-way fixed effects model (unit + time)
mod_twfe <- plm(net_worth ~ specialty_training + factor(time), data = pdata, model = "within")
summary(mod_twfe)$coef
##                       Estimate Std. Error    t-value      Pr(>|t|)
## specialty_training  2.01648657 0.05141831 39.2172850 2.818783e-138
## factor(time)2      -0.02690848 0.14084521 -0.1910501  8.485845e-01
## factor(time)3       0.06149485 0.14098355  0.4361846  6.629411e-01
## factor(time)4       0.05019158 0.14084746  0.3563542  7.217656e-01
## factor(time)5       0.07970746 0.14102209  0.5652126  5.722500e-01

Standard Errors

Panel data can be tricky because the observations are not independent, either over time within one unit or across units. Yet many techniques for estimating exactly how noisy our slope estimates are assume such independence. For example, we can imagine that a voter’s attitude are correlated with themselves over time. Or we could imagine that financial crisis (in a given year) may induce cross-country correlation as, say, British consumers stop buying French products, which affects both countries GDPs. We won’t belabor these issues in this course, except to say that there are various corrections for such problems, including

Note the maintained conceptual difference here: fixed or random effects change the point estimates— the \(\hat{\beta}\)s. Meanwhile, standard errors—robust or otherwise—are all about the variance of those estimates: that is, how noisy they are, and by extension whether we can say a given \(\hat{\beta}\) is statistically significant from zero.

Lab 12: Text as Data

Key Vocabulary

  • Corpus: A structured collection of documents (plural: corpora).

  • Document: A single unit of text (e.g., a tweet, article, or speech).

  • metadata: it means “data about data”. For example, the author, date of writing and program used to produce a letter to someone might be meta-data of interest, separate to the words themselves.

  • Token: A single instance of a word or term within a document.

  • Type: A unique word or term observed in the corpus.

  • Stop Words: Commonly used terms—such as “the,” “and,” “of”—that contribute little analytical value and are often excluded to reduce noise. Removing stop words can simplify text while retaining substantive meaning.

  • Punctuation & Capitalization: Without preprocessing, run, Run, run., and run, would be treated as distinct types. Standardizing by removing punctuation and lowercasing reduces unnecessary variation when the goal is to capture frequency of the underlying term.

  • Tokenization: The process of segmenting text into units (tokens) according to a rule, typically whitespace-based.

  • Tagging: Assigning tokens to linguistic categories (e.g., noun, adjective, verb) to provide grammatical structure (part-of-speech tagging).

  • Stemming: A rule-based procedure that reduces words to root forms by truncating prefixes or suffixes (e.g., removing “-ing” or “-ed”). Stemming may produce nonwords and can be overly aggressive (e.g., springspr).

  • Lemmatization: A dictionary-based normalization method that maps inflected forms to canonical forms while accounting for context. More accurate but computationally costlier than stemming.

  • Bag-of-Words (BoW): A representation that ignores word order and retains only token counts. Although this sacrifices syntactic information, it often preserves sufficient signal for many analytical tasks.

  • Vector Space Representation: Encodes each document as a vector in a high-dimensional space, where dimensions correspond to unique terms. Cell values represent term frequency or importance.

    This yields a Document-Term Matrix (DTM) or Document-Feature Matrix (DFM), with documents as rows and terms as columns, populated by unigram counts.

  • Euclidean Distance: The straight-line distance between two vectors.

  • Cosine Similarity: A similarity metric based on the cosine of the angle between document vectors. It effectively normalizes token counts by accounting for document length.

  • Term Frequency (TF): The number of occurrences of a word within a document.

  • Inverse Document Frequency (IDF): Typically computed as:

\[ \log\left(\frac{N}{n_w}\right) \]

where \(N\) is the total number of documents and \(n_w\) is the number of documents containing word \(w\). This downweights common words that appear in many documents.

Steps

  1. Make a corpus
  2. Preprocess: tokenize, lowercase, remove punctuation/numbers/stopwords, stem/lemmatize
  3. Create a DTM/DFM
  4. Compute distances/similarities
doc1 <- c("The nonpartisan Congressional Budget Office said Wednesday that the broad Republican bill to cut taxes and slash some federal programs would add $2.4 trillion to the already soaring national debt over the next decade, in an analysis that was all but certain to inflame concerns that President Trump’s domestic agenda would lead to excessive government borrowing.")

doc2 <- "At the same time, Republicans have targeted programs that help low-income Americans for cuts. Under the House-passed version of the legislation, nearly 11 million Americans are expected to lose health coverage, for example. The loss in those benefits would overwhelm the modest savings that Americans on the lower rungs of the income ladder would see from tax cuts.

Even with the cuts to safety-net programs, the legislation would still be costly, with the budget office previously estimating that it would add nearly $3 trillion to the debt over the next decade, including additional borrowing costs."

our_corpus <- corpus(c(doc1, doc2))
# lowercase
tokens_lower <- tolower(our_corpus)

# remove punctuation and numbers
tokens_cleaned <- tokens(tokens_lower, remove_punct = TRUE, remove_numbers = TRUE)

# remove stop words
tokens_no_stops <-  tokens_remove(tokens_cleaned, stopwords("en")) 
#stopwords("en") --check what are the stopwords

# stem it
stemmed_corpus  <-tokens_wordstem(tokens_no_stops, language = "english")

# make it into a dfm:
corpus_dfm <- dfm(stemmed_corpus)

corpus_dfm
## Document-feature matrix of: 2 documents, 68 features (40.44% sparse) and 0 docvars.
##        features
## docs    nonpartisan congression budget offic said wednesday broad republican
##   text1           1           1      1     1    1         1     1          1
##   text2           0           0      1     1    0         0     0          1
##        features
## docs    bill cut
##   text1    1   1
##   text2    0   3
## [ reached max_nfeat ... 58 more features ]
# Key Words in Context
kwic(tokens_cleaned, pattern = "budget", window=2) #window specifies how many words to show before and after the keyword
## Keyword-in-context with 2 matches.                                                                   
##   [text1, 4] nonpartisan congressional | budget | office said      
##  [text2, 73]                  with the | budget | office previously
# Compute Euclidean distance
euclidean_dist <- textstat_dist(corpus_dfm, method = "euclidean")

# Compute Cosine similarity
cosine_sim <- textstat_simil(corpus_dfm, method = "cosine")

TF-IDF

# Subset FDR's inaugural addresses (1933, 1937, 1941, 1945)
fdr_corpus <- corpus_subset(data_corpus_inaugural, 
                            President == "Roosevelt" & Year %in% c(1933, 1937, 1941, 1945))

# Tokenize and create a document-feature matrix (dfm)
dfm_raw <- dfm(tokens_tolower(tokens(fdr_corpus, remove_punct = TRUE)))

dfm_raw[,"will"]
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 4 docvars.
##                 features
## docs             will
##   1933-Roosevelt   10
##   1937-Roosevelt   15
##   1941-Roosevelt    4
##   1945-Roosevelt    7
dfm_raw[,"expect"]
## Document-feature matrix of: 4 documents, 1 feature (75.00% sparse) and 4 docvars.
##                 features
## docs             expect
##   1933-Roosevelt      1
##   1937-Roosevelt      0
##   1941-Roosevelt      0
##   1945-Roosevelt      0
# Apply tf-idf weighting
dfm_tfidf <- dfm_tfidf(dfm_raw, base= exp(1))

# then, compare: 
dfm_tfidf[,"will"]
## Document-feature matrix of: 4 documents, 1 feature (0.00% sparse) and 4 docvars.
##                 features
## docs             will
##   1933-Roosevelt    0
##   1937-Roosevelt    0
##   1941-Roosevelt    0
##   1945-Roosevelt    0
dfm_tfidf[,"expect"]
## Document-feature matrix of: 4 documents, 1 feature (75.00% sparse) and 4 docvars.
##                 features
## docs               expect
##   1933-Roosevelt 1.386294
##   1937-Roosevelt 0       
##   1941-Roosevelt 0       
##   1945-Roosevelt 0

Using Bayes Theorem

In any case though, the thing we’ve calculated is not the thing we actually want. We have calculated the \(\Pr(d|c)\). This is the probability of seeing this collection of term in an email, given the email is spam. But we don’t know if the email is spam—we are trying whether it is or not so we can filter it out or not.

So what we actually want is \(\Pr(c|d)\): the probability that this document is from spam given the words it contains. Fortunately we have a machine that can take us from \(\Pr(d|c)\) to \(\Pr(c|d)\) and that machine is called *Bayes Theorem** (or Bayes Rule, or Bayes Law).

Generically for any conditional probability we know that

\[ \Pr(A|B) = \frac{\Pr(A,B)}{\Pr(B)} \] In words: the probability of event \(A\) occuring given event \(B\) has occurred is the probability of \(A\) and \(B\) occurring, divided by the probability of (just) event \(B\) occurring. A much more useful way to express this, however is:

\[ \Pr(A|B) = \frac{\Pr(A)\Pr(B|A)}{\Pr(B)} \]

Where

  • \(\Pr(A)\) is the probability of \(A\) occurring, called our prior on \(A\)
  • \(\Pr(B|A)\) is the probability of \(B\) occurring given \(A\) occurred. This is sometimes called the likelihood of the data.

Notice that we now have \(\Pr(A|B)\) expressed in terms of \(\Pr(B|A)\). But that means we can express \(\Pr(c|d)\) in terms of \(\Pr(d|c)\) which is what we wanted. So:

\[ \Pr(c|d)=\frac{\Pr(c)\Pr(d|c)}{\Pr(d)} \]

But we can actually simplify even further. Notice that the denominator, \(\Pr(d)\) doesn’t affect whether a given value of \(c\) is more or less plausible. The denominator doesn’t involve \(c\) at all. Because of this, we can drop the denominator altogether, and write

\[ \Pr(c|d) \propto \Pr(c)\Pr(d|c) \] The \(\propto\) symbol means “proportional to”. It’s telling us that when the quantity on the right hand side increases, the thing we care about—\(\Pr(c|d\)—is larger. When the quantity on the right hand side decreases, the thing we care about—\(\Pr(c|d\)—is also lower. It’s not telling us exactly how much higher or lower it is, but we’ll see we don’t need that.

Just substituting in the terms we had earlier, we now see that:

\[ \Pr(c|d)\propto\underbrace{\Pr(c)}_{\mathrm{prior}}\;\;\underbrace{\prod^{K}_{k=1}\Pr(t_k|c)}_{\mathrm{likelihood}} \]

Using the Naive Bayes Classifier

We want to classify new data (emails), based on patterns we observe in our training set (which we will classify by hand). For example, we might look at 10,000 emails to this point in time, and classify them as being in spam or ham. That is, \(c\in\{\texttt{spam},\texttt{ham}\}\). We use that information, and the terms associated with the two classes, to categorize tomorrow’s email.

In particular, we typically want to assign the document—the email— to a single best class. We have a special name for this ‘best’ class. It is the maximum a posteriori class, or \(c_{map}\). It is whatever class leads to the highest \(\Pr(c|d)\) for this particular email in front of us. Intuitively, it is the answer to the question “of the two classes, which seems most plausible as a place to assign this email?”

Where do we get the various quantities we need? Well, we want

\[ \Pr(c|d)\propto\underbrace{\Pr(c)}_{\mathrm{prior}}\;\;\underbrace{\prod^{K}_{k=1}\Pr(t_k|c)}_{\mathrm{likelihood}} \]

  • the likelihood, \(\prod^{K}_{k=1}\Pr(t_k|c)\) comes from multiplying all the probabilities we discussed above, together. We get a given probability, like \(\Pr(\texttt{dollar}|\texttt{spam})\) by asking how often the word occurs in the spam emails (in the training set), divided out the total number of words we saw in the emails.
  • the prior, \(\Pr(c)\) we can get from anywhere, but the logical place is just proportion of documents in our training data that are spam or ham respectively. This will sum to one.

An example

Let’s work through an example. In our training set, we have 5 emails. We list there words (clearly we have stemmed or stopped out a bunch of other terms), and then whether we human-coded them as spam or ham.

email words classification
training 1 money inherit prince spam
2 prince inherit amount spam
3 inherit plan money ham
4 cost amount amazon ham
5 prince william news ham
test 6 prince prince money ?

Now a new email comes in, and we want to know which category it belongs to. The email consists of three words: prince, prince, money.

Our prior on it being ham is just how common ham was in our training set. So, 3 out of 5, or \(\frac{3}{5}\). We calculate the likelihood as

\[ \Pr(\texttt{prince}|\texttt{ham}) \times \Pr(\texttt{prince}|\texttt{ham}) \times \Pr(\texttt{money}|\texttt{ham}) \]

We get those quantities directly from the training set. Among the ham emails, the probability of seeing prince was \(\frac{1}{9}\). This is literally the fraction of all terms in ham that were prince. The probability of seeing money was \(\frac{1}{9}\) for similar reaons. So, the likehood is

\[ \frac{1}{9}\times \frac{1}{9}\times \frac{1}{9} = \frac{1}{729} \] Now, we can combine that with the prior to obtain \(\Pr(\texttt{ham}|d)\), or rather, something proportional to it:

\[ \Pr(\texttt{ham}|d) \propto \frac{3}{5} \times \frac{1}{729} = 0.00082. \]

Using the same logic for \(\Pr(\texttt{spam}|d)\), we get:

\[ \Pr(\texttt{spam}|d) \propto \frac{2}{5} \times \frac{2}{6} \times \frac{2}{6} \times \frac{1}{6} . \]

This is

\[ \Pr(\texttt{spam}|d) \propto 0.0074 \]

Clearly, the \(\Pr(\texttt{spam}|d)\) is much larger than \(\Pr(\texttt{ham}|d)\), so our prediction given the data is that the email is spam. Another way to put this is that spam is our maximum a posteriori class or \(c_{\text{map}}\). Notice that we didn’t need to calculate the exact probability of spam or ham: we just had to work out which probability was larger.

Using R for naive Bayes

We can perform the same analysis much more quickly using quanteda. First, we create the relevant corpus:

require(quanteda)
require(quanteda.textmodels)
## Loading required package: quanteda.textmodels
email1 <- c("money inherit prince") 
email2 <- c("prince inherit amount")
email3 <- c("inherit plan money")
email4 <- c("cost amount amazon")
email5 <- c("prince william news")
email6 <- c("prince prince money")
the_corpus <- corpus(c(email1, email2, email3, email4, email5, email6))
docvars(the_corpus, "category") <- c("spam", "spam", "ham", "ham", "ham", NA) 

# make into a dfm
dfm_corp <- dfm(tokens(the_corpus))

# Split into training and test sets
dfm_train <- dfm_corp[1:5, ]
dfm_test  <- dfm_corp[6, ]

# Fit Naive Bayes model
nb_model <- textmodel_nb(dfm_train, docvars(the_corpus, "category")[1:5])

# Predict category of email6
prediction <- predict(nb_model, newdata = dfm_test)

# View result
print(prediction)
## text6 
##  spam 
## Levels: ham spam

We can double-check our working above:

# Predict class probabilities for email6
predict(nb_model, newdata = dfm_test, type = "prob")
##        
## docs          ham      spam
##   text6 0.2045827 0.7954173

Aside: Derivation of Bayes Rule

If you’re interested, here is how Bayes Rule is derived. We said that

\[ \Pr(A|B) = \frac{\Pr(A,B)}{\Pr(B)} \]

Let’s write the same expression but in terms of \(B|A\). Then:

\[ \Pr(B|A) = \frac{\Pr(B,A)}{\Pr(A)} \]

But the probability of \(A\) and \(B\) occuring is the same as the probability of \(B\) and \(A\) occurring. That is, \(\Pr(A,B)=\Pr(B,A)\). From the second equation, we also know that

\[ \Pr(B,A)={\Pr(A)}\times \Pr(B|A) \] But if this quantity is also equal to \(\Pr(A,B)\), it must be that

\[ \Pr(A|B) = \frac{\Pr(A)\Pr(B|A)}{\Pr(B)} \]